CN113053373A - Intelligent vehicle-mounted voice interaction system supporting voice cloning - Google Patents

Intelligent vehicle-mounted voice interaction system supporting voice cloning Download PDF

Info

Publication number
CN113053373A
CN113053373A CN202110216036.3A CN202110216036A CN113053373A CN 113053373 A CN113053373 A CN 113053373A CN 202110216036 A CN202110216036 A CN 202110216036A CN 113053373 A CN113053373 A CN 113053373A
Authority
CN
China
Prior art keywords
voice
module
cloning
user
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110216036.3A
Other languages
Chinese (zh)
Inventor
孙琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shengtong Information Technology Co ltd
Original Assignee
Shanghai Shengtong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shengtong Information Technology Co ltd filed Critical Shanghai Shengtong Information Technology Co ltd
Priority to CN202110216036.3A priority Critical patent/CN113053373A/en
Publication of CN113053373A publication Critical patent/CN113053373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

The embodiment of the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, so that the quality and the service experience of vehicle-mounted voice interaction are improved. An intelligent vehicle-mounted voice interaction system supporting voice cloning comprises a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module. Compared with the prior art, the embodiment of the invention has the technical effects and advantages that: the embodiment of the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which not only can carry out real-time voice interaction with a user, but also can customize specific voice and idioms for voice response according to the requirements of the user. Therefore, the voice interaction system in the embodiment of the invention can quickly and conveniently convert the user voice instruction into actual driving operation, ensures the driving safety of the user, more importantly, can provide intelligent, personalized and humanized interaction of 'thousands of people and thousands of voices', endows the vehicle response system with emotional colors, greatly improves the driving experience of the user, and makes the journey more warm and comfortable.

Description

Intelligent vehicle-mounted voice interaction system supporting voice cloning
Technical Field
The invention relates to the technical field of voice interaction, in particular to an intelligent vehicle-mounted voice interaction system supporting voice cloning.
Background
In recent years, with the rapid development of the economic level of China and the quality of life of people, automobiles become important transportation tools essential for people to go out in daily life, and play a great role in various scenes such as daily commuting, holiday travel, cargo transportation and the like. The vehicle-mounted intelligent interactive system can provide convenient and rapid driving auxiliary service, driving experience of drivers and passengers is greatly improved, and the automobile is upgraded to a humanized emotion partner from a cold delivery vehicle. The voice interaction has the outstanding advantages of being rapid in input, simple in operation, safe and guaranteed, and the like, is an intelligent interaction mode naturally adaptive to a vehicle-mounted environment, and can provide various services such as application query, intelligent navigation, music playing, driving operation execution and the like.
At present, the functions of a vehicle-mounted voice interaction system are very limited, some simple operations can be realized by accessing a voice control terminal to some vehicle models, but the vehicle-mounted voice interaction system has poor voice recognition capability, simple functions, insufficient stability and mechanical interaction process, and cannot meet the interaction requirements of increasing intellectualization, humanization and individuation.
The voice cloning technology can extract the voice characteristics and logic characteristics of a specific speaker and simulate the unique voice and idiom of the speaker. The voice cloning technology is applied to the construction of the vehicle-mounted voice interaction system, customized services of 'thousands of people and thousands of voices' can be provided according to user preferences, emotional interaction is generated with a user while user instructions are intelligently read and stably executed, driving experience is improved, and the trip feeling between the user and a vehicle is strengthened.
Disclosure of Invention
In order to solve the above problems, the embodiment of the present invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, so as to improve the quality of vehicle-mounted voice interaction and service experience.
In order to achieve the above purpose, the embodiment of the invention provides the following technical scheme:
an intelligent vehicle-mounted voice interaction system supporting voice cloning comprises a corpus collection module, a command receiving module, a command analysis module, a command execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module.
Corpus collection module: the method comprises the steps of collecting an original target corpus which a user wants to clone through an external voice receiver, carrying out preprocessing such as noise reduction, filtering and volume equalization on the original target corpus, and inputting the preprocessed target corpus into a text feature extraction module and a voice feature extraction module.
The text feature extraction module: and receiving the target language material input by the language material acquisition module, and performing voice recognition on the target language material to obtain text information of the target language material. And converting the text information into a text characteristic vector to form a characteristic vector space of the text information, and storing the text characteristic vector space.
The voice feature extraction module: receiving the target corpus input by the corpus collection module, extracting acoustic features (such as linear predictive coding features, Mel frequency cepstral coefficients, glottal waves and the like), prosodic features (intonation, time domain distribution, accents and the like), energy features (short-time energy, short-time average amplitude and the like), and tone color features (pitch period, pitch frequency, formants and the like) of the target corpus, forming a speech feature vector space, and storing the speech feature vector space.
An instruction receiving module: the method comprises the steps of receiving an original voice command sent by a user in the driving process through an external voice receiver, carrying out preprocessing such as user identity verification, user authority determination, environmental sound separation and the like, and inputting the preprocessed voice command into a command analysis module.
The instruction analysis module: and receiving the voice instruction input by the instruction receiving module, intelligently analyzing the user intention, obtaining a corresponding instruction processing result, activating and inputting the instruction processing result into the instruction execution module and/or the text response module.
The instruction execution module: connected with the automobile control port. And after the instruction analysis module is activated, receiving the instruction processing result input by the instruction analysis module, and sending an execution command to the corresponding control port according to the content of the instruction processing result.
A text response module: and after the instruction analysis module is activated, the instruction processing result input by the instruction analysis module is received, the text feature vector space stored by the text feature extraction module is called, a response text with characteristics similar to the characteristics of the cloned object word sending sentence is intelligently generated, and the response text is input into the voice synthesis module.
Cloning and synthesizing a module: receiving the response text input by the text response module, calling a voice feature vector space stored by the voice feature extraction module, training a voice synthesis model (Merlin, WaveNet, Tacotron, Clarinet and other voice synthesis models) according to the voice feature vector space parameters, generating a voice spectrogram similar to the sound of the clone object, and inputting the voice spectrogram into a voice interaction module.
A voice output module: and receiving the voice spectrogram input by the voice synthesis module, decoding the voice spectrogram by using a vocoder (a WaveRNN, a Griffin-Lim vocoder and the like) to generate a voice signal, giving a voice response through an external voice player, and achieving intelligent voice interaction with a user.
A basic support module: the basic functions required by the intelligent vehicle-mounted voice interaction system supporting voice cloning, provided by the invention, are supported, such as operations of deleting, selecting, cleaning the memory, updating the version, self-checking and error reporting.
Compared with the prior art, the invention has the technical effects and advantages that: the invention provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which can not only perform real-time voice interaction with a user, but also customize specific voice and idioms for voice response according to the requirements of the user. Therefore, the voice interaction system in the embodiment of the invention can quickly and conveniently convert the voice instruction of the user into the actual driving operation, ensures the driving safety of the user, more importantly, can provide the intelligent, personalized and humanized interaction of 'thousands of people and thousands of voices', endows the vehicle response system with emotional colors, greatly improves the driving experience of the user, and makes the journey more warm and comfortable.
Drawings
Fig. 1 is a schematic flow chart of an intelligent vehicle-mounted voice interaction system supporting voice cloning in a specific application scenario according to an embodiment of the present invention.
Detailed Description
For the convenience of understanding and implementing the embodiment of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of some, and not necessarily all, embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In order to realize the construction of the vehicle-mounted voice interaction system, customized service of 'thousands of people and thousands of sounds' is provided according to user preferences, emotional interaction is generated with a user while user instructions are intelligently read and stably executed, and the aim of improving driving experience is fulfilled, the invention provides an example 1 of the intelligent vehicle-mounted voice interaction system supporting sound cloning; FIG. 1 is a schematic flow chart of embodiment 1 of implementing intelligent voice interaction according to the present invention; as shown in fig. 1, the following modules and steps may be included:
the patent provides an intelligent vehicle-mounted voice interaction system supporting voice cloning, which comprises a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a cloning synthesis module, a voice output module and a basic support module.
Corpus collection module: the original target corpus that the user wishes to clone is collected by a peripheral voice receiver (such as an on-board microphone array, etc., which is not included in the scope of the present invention) in a vehicle or other environment. In order to ensure the usability of the original target corpus, the original target corpus should be recorded in a quieter environment, and about 10-50 different cloned targets should be recorded. After the recording is finished, the corpus collection module automatically carries out preprocessing such as noise reduction, filtering, volume equalization and the like on the original target corpus, and inputs the preprocessed target corpus into the text feature extraction module and the voice feature extraction module.
The text feature extraction module: and receiving the target language material input by the language material acquisition module, and performing voice recognition on the target language material to obtain text information of the target language material. And converting the text information into a text feature vector, and forming and storing a text feature vector space.
The voice feature extraction module: and receiving the target corpus input by the corpus collection module, extracting acoustic features (such as linear predictive coding features, Mel frequency cepstral coefficients, glottal waves and the like), prosodic features (intonation, time domain distribution, accents and the like), energy features (short-time energy, short-time average amplitude and the like), tone color features (pitch period, pitch frequency, formants and the like) of the target corpus, and forming and storing a voice feature vector space.
An instruction receiving module: the method comprises the steps of receiving an original voice command sent by a user in the driving process through an external voice receiver, carrying out preprocessing such as user identity verification, user authority determination, environmental sound separation and the like, and inputting the preprocessed voice command into a command analysis module. And if the non-authorized user instructs to open the car window, the car window is not considered.
The instruction analysis module: and receiving the voice instruction input by the instruction receiving module, intelligently analyzing the user intention, obtaining a corresponding instruction processing result, activating and inputting the instruction processing result into the instruction execution module and/or the text response module. If the authorized user instructs to open the car window, activating the instruction execution module and sending a car window opening instruction to the instruction execution module; and simultaneously activating a text response module, and inputting a processing result 'requiring to open the window' into the text response module.
The instruction execution module: and is connected with other control ports of the automobile. And after the instruction analysis module is activated, receiving the instruction processing result input by the instruction analysis module, and sending an instruction execution command to the corresponding control port according to the content of the instruction processing result. And if the processing result indicates that the vehicle window is opened, connecting the vehicle window control module and automatically lowering the vehicle window.
A text response module: and after the instruction analysis module is activated, the instruction processing result input by the instruction analysis module is received, the text feature vector space stored by the text feature extraction module is called, a response text with characteristics similar to the characteristics of the cloned object word sending sentence is intelligently generated, and the response text is input into the voice synthesis module. If the processing result ' requires opening a window ', after analyzing the intention of the user, generating a response text ' is already opened for you and is highly suitable? "
Cloning and synthesizing a module: receiving the response text input by the text response module, calling a voice feature vector space stored by the voice feature extraction module, training a voice synthesis model (Merlin, WaveNet, Tacotron, Clarinet and other voice synthesis models) according to the voice feature vector space parameters, generating a voice spectrogram similar to the sound of the clone object, and inputting the voice spectrogram into a voice output module.
A voice output module: and receiving the voice spectrogram input by the voice synthesis module, decoding the voice spectrogram by using a vocoder (a WaveRNN, a Griffin-Lim vocoder and the like) to generate a voice signal, giving a voice response through an external voice player, and achieving intelligent voice interaction with a user. In response "do you open the window for you, is this height appropriate? And after that, if the user has other replies, continuing to respond from the instruction receiving module.
A basic support module: the basic functions required by the intelligent vehicle-mounted voice interaction system supporting voice cloning, which are provided by the embodiment of the invention, are supported, such as operations of deleting, selecting, cleaning the memory, updating the version, self-checking and error reporting.
The above-described embodiments are merely illustrative of several embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that various embodiments of the present invention can be combined freely, and should be regarded as the disclosure of the present invention as long as it does not depart from the idea of the present invention.

Claims (2)

1. The invention relates to an intelligent vehicle-mounted voice interaction system supporting voice cloning, which is used for improving the quality and service experience of vehicle-mounted voice interaction.
2. The intelligent vehicle-mounted voice interaction system supporting voice cloning as claimed in claim 1, comprising a corpus collection module, a text feature extraction module, a voice feature extraction module, an instruction receiving module, an instruction analysis module, an instruction execution module, a text response module, a clone synthesis module, a voice output module, and a basic support module.
CN202110216036.3A 2021-02-26 2021-02-26 Intelligent vehicle-mounted voice interaction system supporting voice cloning Pending CN113053373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110216036.3A CN113053373A (en) 2021-02-26 2021-02-26 Intelligent vehicle-mounted voice interaction system supporting voice cloning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110216036.3A CN113053373A (en) 2021-02-26 2021-02-26 Intelligent vehicle-mounted voice interaction system supporting voice cloning

Publications (1)

Publication Number Publication Date
CN113053373A true CN113053373A (en) 2021-06-29

Family

ID=76509182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110216036.3A Pending CN113053373A (en) 2021-02-26 2021-02-26 Intelligent vehicle-mounted voice interaction system supporting voice cloning

Country Status (1)

Country Link
CN (1) CN113053373A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011186143A (en) * 2010-03-08 2011-09-22 Hitachi Ltd Speech synthesizer, speech synthesis method for learning user's behavior, and program
CN106790938A (en) * 2016-11-16 2017-05-31 上海趣讯网络科技有限公司 A kind of man-machine interaction onboard system based on artificial intelligence
CN108711423A (en) * 2018-03-30 2018-10-26 百度在线网络技术(北京)有限公司 Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
KR20190107289A (en) * 2019-08-30 2019-09-19 엘지전자 주식회사 Artificial robot and method for speech recognitionthe same
CN111399798A (en) * 2020-03-10 2020-07-10 上海博泰悦臻电子设备制造有限公司 Vehicle-mounted voice assistant personalized realization method, system, medium and vehicle-mounted equipment
CN111429882A (en) * 2019-01-09 2020-07-17 北京地平线机器人技术研发有限公司 Method and device for playing voice and electronic equipment
CN112233646A (en) * 2020-10-20 2021-01-15 携程计算机技术(上海)有限公司 Voice cloning method, system, device and storage medium based on neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011186143A (en) * 2010-03-08 2011-09-22 Hitachi Ltd Speech synthesizer, speech synthesis method for learning user's behavior, and program
CN106790938A (en) * 2016-11-16 2017-05-31 上海趣讯网络科技有限公司 A kind of man-machine interaction onboard system based on artificial intelligence
CN108711423A (en) * 2018-03-30 2018-10-26 百度在线网络技术(北京)有限公司 Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN111429882A (en) * 2019-01-09 2020-07-17 北京地平线机器人技术研发有限公司 Method and device for playing voice and electronic equipment
KR20190107289A (en) * 2019-08-30 2019-09-19 엘지전자 주식회사 Artificial robot and method for speech recognitionthe same
CN111399798A (en) * 2020-03-10 2020-07-10 上海博泰悦臻电子设备制造有限公司 Vehicle-mounted voice assistant personalized realization method, system, medium and vehicle-mounted equipment
CN112233646A (en) * 2020-10-20 2021-01-15 携程计算机技术(上海)有限公司 Voice cloning method, system, device and storage medium based on neural network

Similar Documents

Publication Publication Date Title
Delić et al. Speech technology progress based on new machine learning paradigm
US20230230572A1 (en) End-to-end speech conversion
JP3479691B2 (en) Automatic control method of one or more devices by voice dialogue or voice command in real-time operation and device for implementing the method
US9570066B2 (en) Sender-responsive text-to-speech processing
JP2004525412A (en) Runtime synthesis device adaptation method and system for improving intelligibility of synthesized speech
CN112581963B (en) Voice intention recognition method and system
KR19980070329A (en) Method and system for speaker independent recognition of user defined phrases
US20040098259A1 (en) Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system
CN110539721A (en) vehicle control method and device
Nafis et al. Speech to text conversion in real-time
Lee MLP-based phone boundary refining for a TTS database
JP6993376B2 (en) Speech synthesizer, method and program
Bou-Ghazale et al. HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress
Al-Anzi et al. The capacity of mel frequency cepstral coefficients for speech recognition
Kothadiya et al. Different methods review for speech to text and text to speech conversion
Wan et al. Building HMM-TTS voices on diverse data
CN113053373A (en) Intelligent vehicle-mounted voice interaction system supporting voice cloning
CN116312476A (en) Speech synthesis method and device, storage medium and electronic equipment
CN115938340A (en) Voice data processing method based on vehicle-mounted voice AI and related equipment
Westphal et al. Towards spontaneous speech recognition for on-board car navigation and information systems
Flanagen Talking with computers: Synthesis and recognition of speech by machines
Atal et al. Speech research directions
Matsumoto et al. Speech-like emotional sound generator by WaveNet
Lee The conversational computer: an apple perspective.
CN112185368A (en) Self-adaptive man-machine voice conversation device and equipment, interaction system and vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210629