CN112927688A

CN112927688A - Voice interaction method and system for vehicle

Info

Publication number: CN112927688A
Application number: CN202110096485.9A
Authority: CN
Inventors: 符晓乐
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-08
Anticipated expiration: 2041-01-25
Also published as: CN112927688B

Abstract

The embodiment of the invention provides a voice interaction method for a vehicle. The method comprises the following steps: collecting sound information and video information from each seat in the vehicle; performing voice recognition on the voice information, and performing first scoring on the voice information based on a voice recognition result; performing second scoring on the mouth shape of the passenger in the video information based on the video image of the reference mouth shape corresponding to the voice recognition result; and comprehensively judging the first scoring result and the second scoring result to determine whether to perform voice interaction. The embodiment of the invention also provides a voice interaction system for the vehicle. The embodiment of the invention utilizes an image recognition algorithm to detect and recognize the mouth shape of each seat user in real time. Whether voice interaction is carried out or not is judged through multiple factors of the mouth shape and the voice, the voice recognition effect and the interaction accuracy rate in the whole vehicle are improved, and various phonemes which influence the voice signals and the video signals in the vehicle are considered for carrying out elastic processing. Thereby further improving the voice interaction effect.

Description

Voice interaction method and system for vehicle

Technical Field

The invention relates to the field of intelligent voice, in particular to a voice interaction method and system for a vehicle.

Background

The mouth shape auxiliary voice recognition technology is a method combining mouth shape image recognition and voice recognition. Firstly, voice recognition and judgment are carried out, and then mouth shape recognition and judgment are carried out, so that the false touch rate of voice awakening is reduced. This is of great help for speech recognition in cars.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

in the prior art, whether the mouth shape changes is detected through image recognition to reject the voice signal, that is, whether the mouth shape is open or closed is only checked, for example, the mouth shape is in an open mouth state or a closed mouth state at the moment when the user is detected with sound, the recognition is judged through the mouth shape assistance, a voice awakening model is not enhanced by using mouth shape recognition data, the capability of positioning a sound source of a whole vehicle is lacked, the voice signal of the whole vehicle cannot be directionally enhanced and reversely transplanted and suppressed, and the voice signals at different positions are separated. Meanwhile, a front-end signal processing method of voice is adopted, and the voice awakening effect is improved by improving the signal-to-noise ratio of the signal. Practitioners in the industry mainly improve the voice awakening effect from the aspect of front-end signals, but the voice recognition effect under the condition of extremely low signal-to-noise ratio is difficult to improve by starting from the aspect of front-end signal processing.

Disclosure of Invention

The method at least solves the problem that the voice recognition effect is not improved by a multi-mode method combining visual recognition, front-end signal processing and voice recognition in the prior art.

In a first aspect, an embodiment of the present invention provides a voice interaction method for a vehicle, including:

collecting sound information and video information from each seat in the vehicle;

performing voice recognition on the sound information, and performing first scoring on the sound information based on the voice recognition result;

performing second scoring on the mouth shape of the passenger in the video information based on the video image of the reference mouth shape corresponding to the voice recognition result;

and comprehensively judging the first scoring result and the second scoring result to determine whether to perform voice interaction.

In a second aspect, an embodiment of the present invention provides a voice interaction system for a vehicle, including:

the information acquisition program module is used for acquiring sound information and video information from each seat in the vehicle;

the voice scoring program module is used for carrying out voice recognition on the voice information and carrying out first scoring on the voice information based on the voice recognition result;

the video image scoring program module is used for carrying out second scoring on the mouth shape of the passenger in the video information based on the video image of the reference mouth shape corresponding to the voice recognition result;

and the judging program module is used for comprehensively judging the first scoring result and the second scoring result and determining whether voice interaction is performed.

In a third aspect, an electronic device is provided, comprising: the vehicle voice interaction system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the voice interaction method for the vehicle according to any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is used to implement the steps of the voice interaction method for a vehicle according to any embodiment of the present invention when executed by a processor.

The embodiment of the invention has the beneficial effects that: and detecting and identifying the mouth shape of each seat user in real time by using an image identification algorithm. Whether voice interaction is carried out or not is judged through multiple factors of the mouth shape and the voice, the voice recognition effect and the interaction accuracy rate in the whole vehicle are improved, and 'elasticity' processing is carried out by considering various phonemes which can influence the voice signals and the video signals in the vehicle. Thereby further improving the voice recognition effect. And sound signals and image signals of each seat of the whole car are acquired through the whole car distributed microphone display and the car camera. Separating the voice signals of all positions of the whole vehicle through a multi-sound-zone front-end signal processing algorithm, and acquiring the enhanced clean audio of each seat through directional enhancement and reverse suppression.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for voice interaction in a vehicle according to an embodiment of the present invention;

FIG. 2 is a block diagram of an overall process of a voice interaction method for a vehicle according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice interaction system for a vehicle according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a voice interaction method for a vehicle according to an embodiment of the present invention, including the following steps:

s11: collecting sound information and video information from each seat in the vehicle;

s12: performing voice recognition on the sound information, and performing first scoring on the sound information based on the voice recognition result;

s13: performing second scoring on the mouth shape of the passenger in the video information based on the video image of the reference mouth shape corresponding to the voice recognition result;

s14: and comprehensively judging the first scoring result and the second scoring result to determine whether to perform voice interaction.

In this embodiment, the method can be adapted to various types of vehicles, such as two-seat, four-seat, six-seat vehicles, while the different vehicle spaces are not limited, for example, all seats are most often in one vehicle space, or the driver has a separate vehicle space and the other passengers are in another vehicle space. For the above-mentioned different types of motorcycle types, when the adaptation, only need have corresponding microphone and camera in the vehicle space can.

For step S11, in use, since it is required to collect the audio information and the video information from each of the seats in the vehicle in real time in order to respond to the user' S dialogue in real time, the video information for each seat may be separately installed at the front end of each seat, for example, the rear back of the seat, or the center rear view mirror, so that each camera may obtain the video information for each seat. A special position can be selected according to the determined vehicle type, video information on all seats can be collected, and partial video information corresponding to each seat is extracted from the video. The sound information may be obtained by a microphone disposed in the vehicle. The microphone can be selected from various types, and a microphone can be arranged at each seat, or a microphone can be arranged in the middle of the vehicle, and then different sound receiving areas are divided to obtain different audio frequencies.

Four seats are arranged in a vehicle, and people are respectively arranged on the four seats. In the acquisition, the sound information and the video information of the four persons can be obtained respectively.

For step S12, four persons have four different pieces of video information and audio information, and for example, a user is used to perform voice recognition on the voice information of the user, for example, the user says "slow" (as a wake-up word). After recognition, the voice recognition result is 'minimus', and the voice of the user voice information is 'minimus'. Thus, the voice information is scored first, and the similarity score between the voice spoken by the user and the corresponding word can be judged, for example, the similarity score is 0.736.

For step S13, taking the "small-speed" voice continuation in step S12 as an example, since it is determined that the voice information of the user is "small-speed", in the image determination, a video image of a reference mouth shape preset in the vehicle for "small-speed" is searched (in a specific implementation, a large number of reference mouth shape video images are pre-built in the vehicle, and not just wakeup words, but voice commands are available).

With respect to step S14, the voice score and the mouth shape score have been determined in steps S12, S13 for comprehensive judgment, and it is determined whether or not voice interaction is required based on the judgment result.

According to the embodiment, the mouth shape of each seat user is detected and identified in real time by using an image identification algorithm. Whether voice interaction is carried out or not is judged through multiple factors of the mouth shape and the sound, and the voice recognition effect and the interaction accuracy rate in the whole vehicle are improved.

As an implementation manner, in this embodiment, the comprehensively determining the first scoring result and the second scoring result includes:

when the first scoring result exceeds a first preset threshold and the second scoring result exceeds a second preset threshold, executing a voice action corresponding to the voice recognition result;

and when the first scoring result does not exceed a first preset threshold value and the second scoring result does not exceed a second preset threshold value, rejecting the voice action corresponding to the voice recognition result.

When the first scoring result exceeds a first preset threshold value, the second scoring result does not exceed a second preset threshold value, or the first scoring result does not exceed the first preset threshold value, and the second scoring result exceeds the second preset threshold value, performing secondary verification;

in the secondary verification, when the first scoring result exceeds a first preset threshold and the second scoring result does not exceed a second preset threshold, if the error between the second scoring result and the second preset threshold does not exceed a preset mouth shape error, executing a voice action corresponding to the voice recognition result, otherwise, rejecting the voice action;

and when the first scoring result does not exceed a first preset threshold value and the second scoring result exceeds a second preset threshold value, if the error between the first scoring result and the first preset threshold value does not exceed a preset voice error, executing the voice action corresponding to the voice recognition result, otherwise, rejecting the voice action.

In the present embodiment, the overall judgment is made based on the voice score (first score) and the mouth shape score (second), and the judgment is made by taking an example in which both the first preset threshold and the second preset threshold are 0.6. Through comparison, it can be judged that the voice score and the mouth shape score both exceed the respective corresponding preset thresholds, and therefore, the voice action corresponding to the voice recognition result can be executed, and the awakening response is performed on the 'small speed' spoken by the user.

For another example, assume that the sound score is 0.564 minutes and the mouth shape score is 0.473 minutes. Taking the first preset threshold and the second preset threshold as 0.6 as an example, after comprehensive judgment, the voice score and the mouth shape score do not exceed the respective corresponding preset thresholds. And rejecting the voice action corresponding to the voice recognition result without performing corresponding operation.

There are also special cases where, for example, a vehicle may have a noise of the jeer during traveling, and the score of the sound information has a certain influence. For example, the voice score is 0.536 and the mouth shape score is 0.731. For this type of mouth shape, it is very correct, but the voice score is slightly worse, and a certain flexibility is given, for example, the preset voice error is set to 0.1, and the voice score error is 0.6-0.536-0.064 min, which is less than the preset 0.1. Therefore, the voice action corresponding to the voice recognition result is executed, and the problem that the voice score is influenced due to noise in the running process of the vehicle or other conditions during voice collection is solved.

The collected video information is affected by various phonemes, such as jitter, jolt, or lights in the vehicle. For example, the voice score is 0.862 and the mouth shape score is 0.514. for this, the voice score is relatively correct, but the mouth shape score is slightly worse, and a certain flexibility is also given, for example, the preset mouth shape error is set to 0.1, and the mouth shape score error is 0.6-0.514 which is 0.086, which is less than the preset 0.1. Therefore, the voice action corresponding to the voice recognition result is executed, and the problem that the image score is influenced due to the problems of shaking, jolting, lighting and the like in the driving process of a vehicle is solved.

According to the embodiment, the mouth shape of each seat user is detected and identified in real time through an image identification algorithm. And outputting the current recognition result in real time by image recognition and voice recognition. When the two recognition results are consistent and the execution degree is higher, voice response is carried out; when the execution degrees of the two recognition results are lower, rejecting the response; when the image recognition score is high and the voice recognition score is low, a secondary determination is made and a corresponding action is performed, and "elasticity" processing is performed in consideration of a variety of phonemes in the vehicle that affect the sound signal and the video signal. Thereby further improving the voice recognition effect.

As an embodiment, in this embodiment, the collecting the sound information and the video information from each seat in the vehicle includes:

collecting sound information of each seat through a distributed microphone array in the vehicle;

video information of each seat is collected through the camera.

Performing voice front-end signal processing on the sound information through a distributed microphone array, and eliminating background system sound in the sound information to obtain pure audio;

and forming beams of the pure audio to obtain enhanced audio of each seat, wherein the enhanced audio is used for enhancing the human voice in the pure audio.

The voice interaction comprises the following steps: and (4) awakening word interaction and in-vehicle operation instruction interaction.

In the present embodiment, as shown in fig. 2, the distributed microphone array and the in-vehicle camera of the entire vehicle collect the sound signal and the image signal for each seat of the entire vehicle. When neither of the voice signal and the video signal detects the human feature, the acquisition is restarted.

For audio signals, echo cancellation is firstly carried out on the audio signals of a background system through a distributed microphone array and a whole-vehicle voice front-end signal processing algorithm to obtain effective human voice audio. Thereby reducing the interference that occurs with the device. Then, the beam forming is carried out, and the audio frequency of each position seat is directionally enhanced. Thereby returning the enhanced voice of the current seat in preparation for subsequently providing the recognition rate. Then, the audio at each position is obtained by a blind source separation algorithm. The position image and the position audio are in one-to-one correspondence. Preparation is made for the combined scoring of the two subsequent identifications. Finally, noise suppression is performed, so that clean audio of each seat is obtained. And interfering voices are removed, the voice signal-to-noise ratio of the current position is improved, and a foundation is laid for more accurate subsequent voice scoring.

And when the audio signal detects the awakening words, scoring the awakening words by the awakening audio of each path. Meanwhile, for the video signal, the awakening words of the mouth shape image of each seat are scored in real time. The scoring judgment of the video image signals can greatly reduce the problem of mistaken awakening of voice. And when the video signal and the image signal both output the scoring result, performing comprehensive scoring processing. And when the voice score and the image score both exceed the threshold value, executing the voice action, and when the voice score and the image score both exceed the threshold value, rejecting the voice action. And when the score of any recognition exceeds the threshold value and the other recognition does not exceed the threshold value, if the deviation between the recognition not reaching the threshold value and the qualified threshold value is less than or equal to 0.1, responding to voice action, such as responding to the awakening of the user, or performing voice interaction according to the operation instruction of the vehicle spoken by the user. And if the deviation between the recognition of the non-reaching threshold and the qualified threshold is larger than 0.1, rejecting the voice action. Through mutual authentication, the voice recognition rate in a noisy environment can be provided, and the misrecognition rate of the environment can be reduced.

It can be seen from the embodiment that the sound signals and the image signals of each seat of the whole car are collected through the whole car distributed microphone array and the car camera. Separating the voice signals of all positions of the whole vehicle through a multi-sound-zone front-end signal processing algorithm, and acquiring the enhanced clean audio of each seat through directional enhancement and reverse suppression.

Fig. 3 is a schematic structural diagram of a voice interaction system for a vehicle according to an embodiment of the present invention, which can execute the voice interaction method for a vehicle according to any of the above embodiments and is configured in a terminal.

The present embodiment provides a voice interaction system 10 for a vehicle, which includes: an information acquisition program module 11, a sound scoring program module 12, a video image scoring program module 13 and a judgment program module 14.

The information acquisition program module 11 is used for acquiring sound information and video information from each seat in the vehicle; the sound scoring program module 12 is configured to perform speech recognition on the sound information, and perform a first scoring on the sound information based on the speech recognition result; the video image scoring program module 13 is configured to perform a second scoring on the mouth shape of the passenger in the video information based on the video image of the reference mouth shape corresponding to the voice recognition result; the judging program module 14 is configured to perform comprehensive judgment on the first scoring result and the second scoring result, and determine whether to perform voice interaction.

Further, the judging program module includes:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice interaction method for the vehicle in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a voice interaction method for a vehicle in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the vehicle voice interaction system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the voice interaction method for the vehicle according to any embodiment of the invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice interaction method for a vehicle, comprising:

2. The method of claim 1, wherein the comprehensively determining the first scoring result and the second scoring result comprises:

3. The method of claim 2, wherein the method further comprises:

4. The method of claim 1, wherein said collecting sound and video information from each seat in the vehicle comprises:

video information of each seat is collected through the camera.

5. The method of claim 4, wherein the method comprises:

6. The method of claim 1, wherein the voice interaction comprises: and (4) awakening word interaction and in-vehicle operation instruction interaction.

7. A voice interaction system for a vehicle, comprising:

8. The system of claim 7, wherein the decision program module comprises:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.