CN116013251A

CN116013251A - Audio simulation method, device, equipment and storage medium

Info

Publication number: CN116013251A
Application number: CN202211704212.9A
Authority: CN
Inventors: 刘广厚; 冯小琴; 殷昊; 陈云琳
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-25

Abstract

The present disclosure provides an audio simulation method, apparatus, device, and storage medium, the method comprising: acquiring first phoneme information corresponding to a first text, and encoding the first phoneme level information into language representation; acquiring first text information corresponding to a first text, and encoding the first text information into text feature representation; and adding acoustic features to the language representation of the first phoneme level information code based on the acoustic features in the speech synthesis model and the text feature representation corresponding to the first text, and predicting the mel frequency spectrum for audio output by a decoder through the language representation of the first phoneme level information code added with the acoustic features.

Description

Audio simulation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech synthesis, and in particular, to an audio simulation method, apparatus, device, and storage medium.

Background

Currently existing speech synthesis methods are generally sequence-to-sequence synthesis schemes based on autoregressive structures and sequence-to-sequence synthesis schemes based on non-autoregressive structures.

Sequence-to-sequence synthesis schemes based on autoregressive structures, such as: tacotron2, transducer TTS, deep Voice, etc., which benefit from its autoregressive structure to synthesize speech with high naturalness and expressivity, but at the same time suffer from slow reasoning speed.

Sequence-to-sequence synthesis schemes based on non-autoregressive structures, such as: fastspech 2, glow-TTS, etc., which use explicit methods of predicting the duration of synthesized speech utterances while having a faster synthesis rate, but are not fully satisfactory in synthesizing the prosodic representation of audio.

The above-mentioned scheme is limited by the prosodic modeling capability of the model, and it is difficult to synthesize speech close to a real dialogue on dialogue data of high expressive force. Because the real dialogue is accompanied by thinking when a person speaks, the dialogue always contains spoken behaviors such as hesitation, gas words and the like, and the existing acoustic models have difficulty in effectively modeling the behaviors.

Disclosure of Invention

The present disclosure provides an audio simulation method, apparatus, device, and storage medium to solve at least the above technical problems in the prior art.

According to a first aspect of the present disclosure, there is provided an audio simulation method, wherein the method comprises:

acquiring first phoneme information corresponding to a first text, and encoding the first phoneme level information into language representation;

acquiring first text information corresponding to a first text, and extracting text feature representation in the first text information;

based on the acoustic feature information and the text feature representation in the speech synthesis model, a relation between the language representation of the first phoneme level information code and the acoustic information is established, and the language representation of the first phoneme level information code added with the acoustic information is subjected to audio output through a decoder prediction mel frequency spectrum.

In an embodiment, the method further comprises:

extracting a phoneme feature vector corresponding to the first text and a word-level feature vector corresponding to the first text;

and performing first model training on the phoneme-level feature vector and the word-level feature vector to generate the first phoneme information with the spoken language information label corresponding to the first text.

In an embodiment, the method further comprises:

setting corresponding quantity of user identity information according to the quantity of the dialogues in the first text;

and performing feature conversion on the corresponding quantity of user identity information to generate feature representation of the corresponding quantity of user identity information.

In an embodiment, obtaining the first text information corresponding to the first text includes:

and respectively acquiring the context information corresponding to each sentence in the first text, and taking the context information corresponding to each sentence as the first text information.

In one embodiment, establishing a connection between the first phone-level information encoded linguistic representation and the acoustic information includes:

and calling acoustic information in a speech synthesis model according to the text characteristic representation extracted by the context information corresponding to each sentence in the first text, the language representation of the first phoneme level information code and the characteristic representation of the corresponding number of user identity information, and establishing a connection between the language representation of the first phoneme level information code and the acoustic information.

According to a second aspect of the present disclosure, there is provided an audio simulation apparatus, wherein the apparatus comprises:

the phoneme coding unit is used for obtaining first phoneme information corresponding to the first text and coding the first phoneme level information into language representation;

the text information coding unit is used for acquiring first text information corresponding to a first text and extracting text characteristic representations in the first text information;

an acoustic synthesis unit for establishing a connection between the language representation of the first phoneme level information encoding and the acoustic information based on the acoustic feature information and the text feature representation in the speech synthesis model;

an audio output unit for audio-outputting the language representation encoded with the first phoneme level information added with the acoustic information by predicting mel frequency spectrum by the decoder.

In an embodiment, the apparatus further comprises:

the phoneme labeling unit is used for extracting a phoneme feature vector corresponding to the first text and a word-level feature vector corresponding to the first text;

performing first model training on the phoneme-level feature vector and the word-level feature vector to generate first phoneme information with a spoken language information label corresponding to a first text;

the user identity information coding unit is used for setting corresponding quantity of user identity information according to the quantity of the dialogues in the first text;

In one embodiment of the present invention, in one embodiment,

the text information coding unit is further used for respectively acquiring context information corresponding to each sentence in the first text, and taking the context information corresponding to each sentence as the first text information;

the acoustic synthesis unit is further configured to invoke acoustic information in the speech synthesis model according to the text feature representation extracted from the context information corresponding to each sentence in the first text, the language representation encoded by the first phoneme level information, and the feature representation of the corresponding number of user identity information, and establish a relationship between the language representation encoded by the first phoneme level information and the acoustic information.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the present disclosure.

According to the audio simulation method, the device, the equipment and the storage medium, through the method of labeling the spoken events and introducing the context semantic information to simulate the dialogue context, the audio close to the real voice can be synthesized, and meanwhile, the speed of synthesizing the audio based on the non-autoregressive model is high, so that the real-time rate requirement can be met.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic diagram of an implementation flow of an audio simulation method according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of an implementation flow of model training for an audio simulation method in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram showing the constitution of an audio simulation device according to an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of a composition structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, features and advantages of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure will be clearly described in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

FIG. 1 shows a flow chart of an implementation of an audio simulation method according to an embodiment of the present disclosure, as shown in FIG. 1, which includes the steps of

Step 101, obtaining first phoneme information corresponding to a first text, and encoding the first phoneme level information into language representation.

In this embodiment, according to a first text corresponding to a speech to be synthesized, first phoneme information corresponding to the first text is extracted, where the first phoneme information includes a label for a spoken event. Preferably, the text encoder may be used to encode the first phoneme information to generate a vector representation for linking between the acoustic information.

Step 102, obtaining first text information corresponding to a first text, and extracting text feature representation in the first text information.

In this embodiment, according to the first text, context information of sentences in the first text is extracted, and context information corresponding to each sentence is used as the first text information. Preferably, a BERT encoder is employed to introduce sentence-level representations of the extracted context in the first text for modeling context information in the real dialog to enhance the expressive power of the synthesized speech.

And step 103, establishing a connection between the language representation of the first phoneme-level information code and the acoustic information based on the acoustic feature information and the text feature representation in the speech synthesis model.

In this embodiment, it is preferable to use a Variance adapter to establish a relationship between the characteristic representation and acoustic information such as audio duration, and introduce a sentence-level representation of the context to improve the expressive power of the synthesized speech.

Step 104, audio output is performed by predicting mel frequency spectrum through a decoder from the language representation of the first phoneme level information code added with acoustic information.

In this embodiment, the above-mentioned feature sender decoder combined with acoustic information is subjected to decoding operation, and a voice with higher naturalness and expressive force, which corresponds to the first text and contains a real spoken event, is generated.

Fig. 2 is a schematic diagram showing a second implementation flow of model training of an audio simulation method according to an embodiment of the present disclosure, and as shown in fig. 2, the implementation flow of model training of anthropomorphic dialog synthesis according to an embodiment of the present disclosure includes the following steps:

in step 201, phoneme information corresponding to the text is extracted.

In this embodiment, a text to be subjected to dialogue synthesis is selected, and phonemes corresponding to each word in the text are extracted.

And 202, performing model training on the phoneme information corresponding to the text to obtain a feature vector.

In this embodiment, model training is performed on the phonemes corresponding to the extracted text, and preferably, feature conversion is performed on the phonemes by using an embedinTable to obtain a phoneme feature vector representation corresponding to the text.

And 203, extracting word-level feature vectors in the text.

In this embodiment, each word in the text is divided, and a language model is used to extract a feature vector corresponding to each word, and preferably, BERT (pre-training language characterization model) is used to extract a word vector in the text, that is, bertWordEmbedding corresponding to the text.

And 204, performing model training on the feature vector of the phoneme information corresponding to the text and the word-level feature vector in the text to obtain the phoneme-level information containing the spoken language label.

In this embodiment, in order to implement labeling of a spoken word portion in a text phoneme, model training is performed on the word vector corresponding to the text and the phoneme feature vector representation to obtain phoneme information including labeling of the spoken word portion, and preferably, model training is performed on the content by using a full link layer (FC) and a long short term memory neural network (LSTM), specifically, three full link layer training is performed first, two long short term memory neural network training is performed, and finally, one full link layer training is performed again.

In this embodiment, the spoken language labeling includes labeling the continuous reading position of the word and labeling the pause position, and other spoken language events in the data can be labeled for modeling according to the requirement.

And step 205, performing feature vector conversion on the phoneme level information containing the spoken language labels.

In this embodiment, the feature vector conversion is performed on the phoneme information containing the spoken label, and preferably, a Text Encoder (Text Encoder) is selected to encode the phoneme information containing the spoken label into a spoken representation.

Step 206, extracting sentence-level feature vectors of the context in the text.

In this embodiment, in order to simulate the context information in the real dialogue to improve the expressive power of the synthesized speech, the context information needs to be extracted to restore the real dialogue scene, preferably, a BERT encoder is used to introduce a sentence-level representation of the context extracted from the BERT of the context, where the sentence-level representation of the context includes: next Sentence Embedding, current Sentence Embedding, previous Sentence Embedding.

In this embodiment, the context semantic information may be replaced by a model capable of expressing semantic information, such as GPT and BERT, at present.

Step 207 sets a different person ID corresponding to the person speaking in the text.

In this embodiment, in order to distinguish different speakers in the text conversation, according to the number of speakers in the text, a number of person IDs corresponding to the number of speakers in the text is set, and feature vector conversion is performed on the number of person IDs corresponding to the number of speakers in the text through a lookup-up Table.

In step 208, a relationship is established between the characteristic representation and the acoustic information.

In this embodiment, according to the spoken language representation, the upper and lower Wen Gouzi representation, and the set character ID feature vector conversion result, which are encoded by the text encoder, a connection is established between acoustic information, where the acoustic information includes an audio duration and an audio feature, and the acoustic information is collected in the audio corresponding to the text to complete model training. Preferably, the Variance adapter is adopted to establish the connection between the feature representation and the acoustic information, and the decoding operation is carried out on the established feature to complete the model training.

In this embodiment, the basic acoustic model of the synthesized speech may be replaced by another acoustic model such as Tacotron2, fastPitch, etc.

Fig. 3 is a schematic diagram showing the constitution of an audio simulation device according to an embodiment of the present disclosure, and as shown in fig. 3, an audio simulation device according to an embodiment of the present disclosure includes the following units.

The phoneme encoding unit 301 is configured to obtain first phoneme information corresponding to a first text, and encode the first phoneme level information into a language representation.

A text information encoding unit 302, configured to obtain first text information corresponding to a first text, and extract text feature representations in the first text information;

the text information encoding unit 302 is further configured to obtain context information corresponding to each sentence in the first text, and use the context information corresponding to each sentence as the first text information.

An acoustic synthesis unit 303 for establishing a connection between the language representation of the first phoneme level information encoding and the acoustic information based on the acoustic feature information and the text feature representation in the speech synthesis model;

the acoustic synthesis unit 303 is further configured to invoke acoustic information in the speech synthesis model according to the text feature representation extracted from the context information corresponding to each sentence in the first text, the language representation encoded by the first phoneme level information, and the feature representation of the corresponding number of user identity information, and establish a relationship between the language representation encoded by the first phoneme level information and the acoustic information.

An audio output unit 304 for audio output of the language representation encoded with the first phoneme level information added with the acoustic information by predicting mel frequency spectrum by the decoder.

A phoneme labeling unit 305, configured to extract a phoneme feature vector corresponding to the first text and a word-level feature vector corresponding to the first text;

a user identity information encoding unit 306, configured to set a corresponding number of user identity information according to the number of dialogues in the first text;

In an exemplary embodiment, the phoneme encoding unit 301, the text information encoding unit 302, the acoustic synthesizing unit 303, the audio output unit 304, the phoneme labeling unit 305, the user identity information encoding unit 306 may be implemented by one or more central processing units (CPU, central Processing Unit), graphics processors (GPU, graphics Processing Unit), application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components.

The specific manner in which the various modules and units perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

Fig. 4 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as an audio simulation method. For example, in some embodiments, the audio simulation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the audio simulation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the audio simulation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of audio simulation, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 1, wherein the method further comprises:

4. The method of claim 3, wherein the obtaining the first text information corresponding to the first text comprises:

5. A method as defined in claim 4, wherein the associating the first phone-level information encoded linguistic representation with the acoustic information comprises:

6. An audio simulation apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein the apparatus further comprises:

8. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the anthropomorphic dialog synthesis method of any of claims 1-5.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the anthropomorphic dialog synthesis method according to any one of claims 1-5.