CN113053373A

CN113053373A - Intelligent vehicle-mounted voice interaction system supporting voice cloning

Info

Publication number: CN113053373A
Application number: CN202110216036.3A
Authority: CN
Inventors: 孙琪
Original assignee: Shanghai Shengtong Information Technology Co ltd
Current assignee: Shanghai Shengtong Information Technology Co ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-29

Abstract

The embodiment of the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, so that the quality and the service experience of vehicle-mounted voice interaction are improved. An intelligent vehicle-mounted voice interaction system supporting voice cloning comprises a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module. Compared with the prior art, the embodiment of the invention has the technical effects and advantages that: the embodiment of the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which not only can carry out real-time voice interaction with a user, but also can customize specific voice and idioms for voice response according to the requirements of the user. Therefore, the voice interaction system in the embodiment of the invention can quickly and conveniently convert the user voice instruction into actual driving operation, ensures the driving safety of the user, more importantly, can provide intelligent, personalized and humanized interaction of 'thousands of people and thousands of voices', endows the vehicle response system with emotional colors, greatly improves the driving experience of the user, and makes the journey more warm and comfortable.

Description

Intelligent vehicle-mounted voice interaction system supporting voice cloning

Technical Field

The invention relates to the technical field of voice interaction, in particular to an intelligent vehicle-mounted voice interaction system supporting voice cloning.

Background

In recent years, with the rapid development of the economic level of China and the quality of life of people, automobiles become important transportation tools essential for people to go out in daily life, and play a great role in various scenes such as daily commuting, holiday travel, cargo transportation and the like. The vehicle-mounted intelligent interactive system can provide convenient and rapid driving auxiliary service, driving experience of drivers and passengers is greatly improved, and the automobile is upgraded to a humanized emotion partner from a cold delivery vehicle. The voice interaction has the outstanding advantages of being rapid in input, simple in operation, safe and guaranteed, and the like, is an intelligent interaction mode naturally adaptive to a vehicle-mounted environment, and can provide various services such as application query, intelligent navigation, music playing, driving operation execution and the like.

At present, the functions of a vehicle-mounted voice interaction system are very limited, some simple operations can be realized by accessing a voice control terminal to some vehicle models, but the vehicle-mounted voice interaction system has poor voice recognition capability, simple functions, insufficient stability and mechanical interaction process, and cannot meet the interaction requirements of increasing intellectualization, humanization and individuation.

The voice cloning technology can extract the voice characteristics and logic characteristics of a specific speaker and simulate the unique voice and idiom of the speaker. The voice cloning technology is applied to the construction of the vehicle-mounted voice interaction system, customized services of 'thousands of people and thousands of voices' can be provided according to user preferences, emotional interaction is generated with a user while user instructions are intelligently read and stably executed, driving experience is improved, and the trip feeling between the user and a vehicle is strengthened.

Disclosure of Invention

In order to solve the above problems, the embodiment of the present invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, so as to improve the quality of vehicle-mounted voice interaction and service experience.

In order to achieve the above purpose, the embodiment of the invention provides the following technical scheme:

an intelligent vehicle-mounted voice interaction system supporting voice cloning comprises a corpus collection module, a command receiving module, a command analysis module, a command execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module.

Corpus collection module: the method comprises the steps of collecting an original target corpus which a user wants to clone through an external voice receiver, carrying out preprocessing such as noise reduction, filtering and volume equalization on the original target corpus, and inputting the preprocessed target corpus into a text feature extraction module and a voice feature extraction module.

The text feature extraction module: and receiving the target language material input by the language material acquisition module, and performing voice recognition on the target language material to obtain text information of the target language material. And converting the text information into a text characteristic vector to form a characteristic vector space of the text information, and storing the text characteristic vector space.

The voice feature extraction module: receiving the target corpus input by the corpus collection module, extracting acoustic features (such as linear predictive coding features, Mel frequency cepstral coefficients, glottal waves and the like), prosodic features (intonation, time domain distribution, accents and the like), energy features (short-time energy, short-time average amplitude and the like), and tone color features (pitch period, pitch frequency, formants and the like) of the target corpus, forming a speech feature vector space, and storing the speech feature vector space.

An instruction receiving module: the method comprises the steps of receiving an original voice command sent by a user in the driving process through an external voice receiver, carrying out preprocessing such as user identity verification, user authority determination, environmental sound separation and the like, and inputting the preprocessed voice command into a command analysis module.

The instruction analysis module: and receiving the voice instruction input by the instruction receiving module, intelligently analyzing the user intention, obtaining a corresponding instruction processing result, activating and inputting the instruction processing result into the instruction execution module and/or the text response module.

The instruction execution module: connected with the automobile control port. And after the instruction analysis module is activated, receiving the instruction processing result input by the instruction analysis module, and sending an execution command to the corresponding control port according to the content of the instruction processing result.

A text response module: and after the instruction analysis module is activated, the instruction processing result input by the instruction analysis module is received, the text feature vector space stored by the text feature extraction module is called, a response text with characteristics similar to the characteristics of the cloned object word sending sentence is intelligently generated, and the response text is input into the voice synthesis module.

Cloning and synthesizing a module: receiving the response text input by the text response module, calling a voice feature vector space stored by the voice feature extraction module, training a voice synthesis model (Merlin, WaveNet, Tacotron, Clarinet and other voice synthesis models) according to the voice feature vector space parameters, generating a voice spectrogram similar to the sound of the clone object, and inputting the voice spectrogram into a voice interaction module.

A voice output module: and receiving the voice spectrogram input by the voice synthesis module, decoding the voice spectrogram by using a vocoder (a WaveRNN, a Griffin-Lim vocoder and the like) to generate a voice signal, giving a voice response through an external voice player, and achieving intelligent voice interaction with a user.

A basic support module: the basic functions required by the intelligent vehicle-mounted voice interaction system supporting voice cloning, provided by the invention, are supported, such as operations of deleting, selecting, cleaning the memory, updating the version, self-checking and error reporting.

Compared with the prior art, the invention has the technical effects and advantages that: the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which can not only perform real-time voice interaction with a user, but also customize specific voice and idioms for voice response according to the requirements of the user. Therefore, the voice interaction system in the embodiment of the invention can quickly and conveniently convert the voice instruction of the user into the actual driving operation, ensures the driving safety of the user, more importantly, can provide the intelligent, personalized and humanized interaction of 'thousands of people and thousands of voices', endows the vehicle response system with emotional colors, greatly improves the driving experience of the user, and makes the journey more warm and comfortable.

Drawings

Fig. 1 is a schematic flow chart of an intelligent vehicle-mounted voice interaction system supporting voice cloning in a specific application scenario according to an embodiment of the present invention.

Detailed Description

For the convenience of understanding and implementing the embodiment of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of some, and not necessarily all, embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In order to realize the construction of the vehicle-mounted voice interaction system, customized service of 'thousands of people and thousands of sounds' is provided according to user preferences, emotional interaction is generated with a user while user instructions are intelligently read and stably executed, and the aim of improving driving experience is fulfilled, the invention provides an example 1 of the intelligent vehicle-mounted voice interaction system supporting sound cloning; FIG. 1 is a schematic flow chart of embodiment 1 of implementing intelligent voice interaction according to the present invention; as shown in fig. 1, the following modules and steps may be included:

the patent provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which comprises a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module.

Corpus collection module: the original target corpus that the user wishes to clone is collected by a peripheral voice receiver (such as an on-board microphone array, etc., which is not included in the scope of the present invention) in a vehicle or other environment. In order to ensure the usability of the original target corpus, the original target corpus should be recorded in a quieter environment, and about 10-50 different cloned targets should be recorded. After the recording is finished, the corpus collection module automatically carries out preprocessing such as noise reduction, filtering, volume equalization and the like on the original target corpus, and inputs the preprocessed target corpus into the text feature extraction module and the voice feature extraction module.

The text feature extraction module: and receiving the target language material input by the language material acquisition module, and performing voice recognition on the target language material to obtain text information of the target language material. And converting the text information into a text feature vector, and forming and storing a text feature vector space.

The voice feature extraction module: and receiving the target corpus input by the corpus collection module, extracting acoustic features (such as linear predictive coding features, Mel frequency cepstral coefficients, glottal waves and the like), prosodic features (intonation, time domain distribution, accents and the like), energy features (short-time energy, short-time average amplitude and the like), tone color features (pitch period, pitch frequency, formants and the like) of the target corpus, and forming and storing a voice feature vector space.

An instruction receiving module: the method comprises the steps of receiving an original voice command sent by a user in the driving process through an external voice receiver, carrying out preprocessing such as user identity verification, user authority determination, environmental sound separation and the like, and inputting the preprocessed voice command into a command analysis module. And if the non-authorized user instructs to open the car window, the car window is not considered.

The instruction analysis module: and receiving the voice instruction input by the instruction receiving module, intelligently analyzing the user intention, obtaining a corresponding instruction processing result, activating and inputting the instruction processing result into the instruction execution module and/or the text response module. If the authorized user instructs to open the car window, activating the instruction execution module and sending a car window opening instruction to the instruction execution module; and simultaneously activating a text response module, and inputting a processing result 'requiring to open the window' into the text response module.

The instruction execution module: and is connected with other control ports of the automobile. And after the instruction analysis module is activated, receiving the instruction processing result input by the instruction analysis module, and sending an instruction execution command to the corresponding control port according to the content of the instruction processing result. And if the processing result indicates that the vehicle window is opened, connecting the vehicle window control module and automatically lowering the vehicle window.

A text response module: and after the instruction analysis module is activated, the instruction processing result input by the instruction analysis module is received, the text feature vector space stored by the text feature extraction module is called, a response text with characteristics similar to the characteristics of the cloned object word sending sentence is intelligently generated, and the response text is input into the voice synthesis module. If the processing result ' requires opening a window ', after analyzing the intention of the user, generating a response text ' is already opened for you and is highly suitable? "

Cloning and synthesizing a module: receiving the response text input by the text response module, calling a voice feature vector space stored by the voice feature extraction module, training a voice synthesis model (Merlin, WaveNet, Tacotron, Clarinet and other voice synthesis models) according to the voice feature vector space parameters, generating a voice spectrogram similar to the sound of the clone object, and inputting the voice spectrogram into a voice output module.

A voice output module: and receiving the voice spectrogram input by the voice synthesis module, decoding the voice spectrogram by using a vocoder (a WaveRNN, a Griffin-Lim vocoder and the like) to generate a voice signal, giving a voice response through an external voice player, and achieving intelligent voice interaction with a user. In response "do you open the window for you, is this height appropriate? And after that, if the user has other replies, continuing to respond from the instruction receiving module.

A basic support module: the basic functions required by the intelligent vehicle-mounted voice interaction system supporting voice cloning, which are provided by the embodiment of the invention, are supported, such as operations of deleting, selecting, cleaning the memory, updating the version, self-checking and error reporting.

The above-described embodiments are merely illustrative of several embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that various embodiments of the present invention can be combined freely, and should be regarded as the disclosure of the present invention as long as it does not depart from the idea of the present invention.

Claims

1. The invention relates to an intelligent vehicle-mounted voice interaction system supporting voice cloning, which is used for improving the quality and service experience of vehicle-mounted voice interaction.

2. The intelligent vehicle-mounted voice interaction system supporting voice cloning as claimed in claim 1, comprising a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a clone synthesis module, a voice output module, and a basic support module.