CN109941231B

CN109941231B - Vehicle-mounted terminal equipment, vehicle-mounted interaction system and interaction method

Info

Publication number: CN109941231B
Application number: CN201910130763.0A
Authority: CN
Inventors: 李林军; 耿文童; 贾百龙; 周建
Original assignee: Momenta Suzhou Technology Co Ltd
Current assignee: Momenta Suzhou Technology Co Ltd
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2021-02-02
Anticipated expiration: 2039-02-21
Also published as: CN109941231A

Abstract

The embodiment of the invention discloses a vehicle-mounted terminal device, a vehicle-mounted interaction system and an interaction method, wherein the vehicle-mounted terminal device comprises: the intelligent voice recognition system comprises a voice input unit, an image input unit, a sound output unit, a display unit, a processing unit and a communication unit, wherein the units interactively cooperate with one another to complete intelligent voice recognition and image recognition based on machine vision and multi-mode information interaction processing, so that accurate recognition, accurate semantic understanding and personalized output of input information are achieved, and human-computer interaction experience of a user is improved.

Description

Vehicle-mounted terminal equipment, vehicle-mounted interaction system and interaction method

Technical Field

The invention relates to the field of car networking, in particular to an intelligent vehicle-mounted human-computer interaction system and an interaction method.

Background

In modern society, vehicles increasingly permeate life, study and work of people, and automobiles serving as important members of the vehicles gradually become a part of life of people. With the rapid and explosive development of the car networking technology, the interconnection and intellectualization of the car become possible. An important component of the car networking technology is a car-mounted interaction system, which is used as an important bridge for interaction between people and cars, and plays an irreplaceable role in the aspects of safety, comfort, performance improvement and use experience of the cars. The rise of the artificial intelligence technology injects new vitality into the vehicle-mounted interaction system, and a new generation of intelligentization of the vehicle-mounted interaction system is promoted. The intelligent degree and the user experience of the vehicle-mounted interaction system play a key role in improving the comfort level.

In the prior art, a mainstream interaction mode is to realize a common function through a combination of a mechanical button and a touch screen. Since the mechanical buttons are cumbersome to operate, the number of equipment and the frequency of use are both significantly reduced, and the touch screens are gradually and completely replaced by touch screens along with the trend of the expansion of the touch screens in the vehicles. Touch screen interaction usually requires a driver to visually observe and locate a touch position, and a result needs to be checked after the touch, otherwise, relevant result feedback cannot be obtained. However, the viewing screen may distract the driver during the driving of the vehicle, and may reduce the safety of the driving process. In addition, the touch screen is usually arranged in the front row of the automobile, only a front-row driver and passengers can operate the touch screen, and the passengers in the back row cannot obtain an interactive entrance, so that the user experience is influenced.

Another emerging interactive mode is voice interaction, that is, voice is used as an input mode, input is processed through a voice recognition technology, and voice broadcast is used as output. However, voice interaction has high requirements for the environment inside the vehicle, various forms of noise inside the vehicle, such as wind noise, tire noise, engine noise, and vehicle-mounted speaker interference, all have great influence on the accuracy of voice recognition, and particularly for a non-front-mounted voice interaction system, it is difficult to obtain good voice interaction experience.

Besides, part of the vehicle-mounted interaction system also provides gesture interaction and lip language interaction as operation entrances. However, the content that the gesture can express is very limited, the usage rate is not high, and the lip language recognition alone is difficult to realize a high recognition rate, and often needs to be matched with other recognition means such as voice recognition, so that the application is less at present. Therefore, in the prior art, a vehicle-mounted interaction system and an interaction method capable of realizing accurate identification, accurate semantic understanding and personalized output of input information are needed.

Disclosure of Invention

The embodiment of the invention provides a vehicle-mounted terminal device, a vehicle-mounted interaction system and an interaction method, which can realize accurate identification, accurate semantic understanding and personalized output of input information and improve the human-computer interaction experience of a user.

In one aspect, an embodiment of the present invention provides a vehicle-mounted terminal device, including: the voice input unit is used for acquiring a voice input signal; the image input unit is used for acquiring image input signals, wherein the image input signals comprise one or more of face image signals, expression image signals, lip image signals and pupil image signals; a sound output unit for generating a sound output signal; the display unit is used for displaying the interactive information; a processing unit for controlling the voice input unit, the image input unit, the sound output unit, and the display unit, and for processing the voice input signal and the image input signal; wherein the processing unit comprises a machine learning model building unit; the machine learning model establishing unit can establish a machine learning model for one or more of a face image signal, an expression image signal, a lip image signal and a pupil image signal; and the communication unit is used for being connected with the cloud service equipment. By the scheme provided by the embodiment, intelligent voice recognition and image recognition based on machine vision and multi-modal information interaction processing can be realized, so that accurate recognition, accurate semantic understanding and personalized output of input information are realized, and the human-computer interaction experience of a user is improved.

In one possible design, the voice input unit is further configured to remove or reduce noise, which is one of the inventions of the present invention.

In one possible design, the vehicle-mounted terminal further includes a sound-emitting unit, and the sound-emitting unit is configured to transmit the sound output signal.

In another aspect, an embodiment of the present invention provides a vehicle-mounted interaction system, where the system includes a cloud service device and the vehicle-mounted terminal device in the foregoing aspect.

In another aspect, an embodiment of the present invention provides an identity identification method, where the method is based on the vehicle-mounted terminal device in the foregoing aspect, and the method includes: the voice input unit and the image input unit respectively collect the voice input signal and the image input signal; the processing unit extracts facial features in the image input signal; the processing unit performs facial recognition and matching according to the facial features, and determines a user identity and identity feature information associated with the user identity, wherein the identity feature information comprises voiceprint information; the processing unit extracts voiceprint features in the voice input signal; and the processing unit compares the voiceprint characteristics with the voiceprint information and verifies the user identity through the comparison. Through the scheme provided by the embodiment, the face recognition based on the image and the voiceprint recognition based on the voice can be combined with each other, the identity of the user is verified, and the identity recognition precision under the high-security-level scene is met.

In one possible design, the facial features in the image input signal are extracted through machine learning and/or a neural network, so that the efficiency and the accuracy of image recognition are improved, and the method is one of the invention points.

In one possible design, the voiceprint features in the voice input signal are extracted through machine learning and/or a neural network, so that the efficiency and the accuracy of voiceprint recognition are improved, and the method is one of the invention points

In one possible design, the identity information includes one or more of gender, age, character and hobbies of the user, which is one of the inventions of the present invention.

In one possible design, the identity characteristic information includes biometric information of the user, which is one of the inventions of the present invention.

In one possible design, the identity information includes voiceprint information of the user, which is one of the inventions of the present invention.

In another aspect, an embodiment of the present invention provides an in-vehicle positioning method, where the method is based on the in-vehicle terminal device in the above aspect, and the method includes: the image input unit acquires the image input signal; the processing unit extracts lip movements of a user in the image input signal; and the processing unit determines the position area of the user in the vehicle according to the lip action and the mapping relation between the position area in the vehicle and the visual angle range of the image input unit. Through the scheme provided by the embodiment, the estimation of the position area in the vehicle based on the image can be realized, so that the positioning accuracy is improved.

In one possible design, the method further includes: the voice input unit collects the voice input signal; the processing unit carries out sound source positioning according to the voice input signal, and determines the position of the user in the vehicle, so that the facial recognition based on the image and the sound source positioning based on the voice are combined with each other, an interactive in-vehicle personnel positioning method is realized, and the positioning accuracy and efficiency are improved, which is one of the invention points of the invention.

In one possible design, the processing unit performs sound source localization by using TDOA, beamforming, or high-resolution spectrum estimation, which is one of the aspects of the present invention.

In another aspect, an embodiment of the present invention provides a speech recognition method, where the method is based on the vehicle-mounted terminal device in the foregoing aspect, and the method includes: the voice input unit and the image input unit respectively collect the voice input signal and the image input signal; the processing unit performs lip language recognition and expression recognition according to the image input signal; the processing unit carries out voice recognition according to the voice input signal; and the processing unit performs weighted synthesis on the lip language recognition result, the expression recognition result and the voice recognition result to generate an output text. Through the scheme provided by the embodiment, the voice recognition auxiliary lip language recognition and the compound voice recognition of the expression recognition can be realized, the accuracy and the individuation of the voice recognition are improved, and the user experience is improved.

In one possible design, the voice input unit performs noise reduction processing on the voice input signal, so as to ensure the accuracy of the received voice input signal, which is one of the inventions of the present invention.

In one possible design, the speech input unit performs noise reduction processing on the speech input signal through a spectrum screening or a noise reduction filter, which is one of the inventions of the present invention.

In one possible design, the speech input unit performs noise reduction processing on the speech input signal through an artificial intelligence algorithm, machine learning and/or a neural network, which is one of the inventions of the present invention.

In one possible design, the voice input unit performs noise reduction processing on the voice input signal according to the positioning result of the in-vehicle positioning method, which is one of the inventions of the present invention.

In another aspect, an embodiment of the present invention provides a feedback generation method, where the method is based on the vehicle-mounted terminal device in the foregoing aspect, and the method includes: the voice input unit and the image input unit respectively collect the voice input signal and the image input signal; the processing unit determines a user identity and identity characteristic information associated with the user identity according to the voice input signal and the image input signal; the processing unit carries out expression recognition according to the image input signal; the processing unit performs voice recognition according to the voice input signal and the image input signal; the processing unit carries out semantic understanding according to the identity characteristic information, the expression recognition result and the voice recognition result; and the processing unit generates a feedback result according to the semantic understanding result, the identity characteristic information and the position of the user in the vehicle. Through the scheme provided by the embodiment, multi-dimensional semantic understanding can be realized according to the identity characteristic information, the expression recognition result and the voice recognition result of the user, so that an individualized feedback result can be generated, and the user experience is remarkably improved.

In a possible design, the processing unit determines the user identity and the identity characteristic information by using the above identity recognition method, which is one of the inventions of the present invention.

In a possible design, the processing unit recognizes the expression of the user by using the voice recognition method, which is one of the aspects of the present invention.

In one possible design, the processing unit recognizes the voice input signal of the user according to the voice recognition method, which is one of the inventions of the present invention.

In one possible design, the feedback result is voice feedback or display feedback, which is one of the inventions of the present invention.

In one possible embodiment, the display unit is used to output the display feedback in the form of text and/or images, which is one of the aspects of the present invention.

In one possible design, the sound output unit is configured to output the voice feedback as analog voice or machine sound, which is one of the inventions of the present invention.

According to the technical scheme provided by the embodiment of the invention, the accurate identification, the accurate semantic understanding and the personalized output of the input information are realized through the multi-mode vehicle-mounted interaction system, and the human-computer interaction experience of the user is improved. One of the innovative aspects of the present invention includes a system for integrating the input signals and output units described above for use in a number of applications in a vehicle. The convenient riding experience is provided for people in the vehicle.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below.

Fig. 1 is a schematic diagram of a vehicle-mounted interaction system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an identity recognition method according to an embodiment of the present invention;

fig. 3 is a flowchart of an identity recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an in-vehicle positioning method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 6 is a flow chart of a speech recognition method according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a feedback generation method according to an embodiment of the present invention;

fig. 8 is a flowchart of a feedback generation method according to an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The embodiments of the present invention will be described in detail and fully with reference to the accompanying drawings.

The solution proposed by the embodiment of the present invention is based on the vehicle-mounted interactive system 100 shown in fig. 1. The vehicle-mounted interaction system 100 may be installed in a front row area of an automobile, or may be installed in a rear row area of the automobile, which is not specifically limited in the embodiment of the present invention. The vehicle-mounted interaction system 100 comprises a vehicle-mounted terminal device 200 and a cloud service device 300. Specifically, the in-vehicle terminal apparatus includes a voice input unit 201, an image input unit 202, a sound output unit 203, a display unit 204, a processing unit 205, and a communication unit 206.

The voice input unit 201 is used for collecting a voice input signal, which may be a voice command sent by a user in a vehicle. The voice input unit 201 may be a microphone or a microphone array. The microphone can be arranged in the front row area in the automobile to better receive the instruction of the driver; the automobile voice signal receiving device can also be installed in a rear row area in the automobile, so that the voice signal from a rear row passenger can be received, and the interactive experience of the rear row passenger is improved. The microphone array can be arranged in the whole space inside the automobile in a surrounding mode, so that sound signals of all angles inside the whole automobile can be collected, and the richness and comprehensiveness of sound signal collection are improved. The voice input unit 201 may also be an audio collector or a sound pickup with higher collection accuracy.

Optionally, the voice input unit 201 has a noise reduction function. The noise reduction function may reduce or remove ambient noise, or may remove unwanted interference noise, for example, background noise from a passenger conversation in a vehicle is reduced or removed when a driver sends a voice command to the in-vehicle terminal apparatus 200.

Optionally, the voice input unit 201 removes the influence of the noise signal in the spectrum signal through spectrum screening. The voice input unit 201 may also attenuate the influence of noise signals through a noise reduction filter. The voice input unit 201 may also adopt an artificial intelligence method, and perform model training through machine learning and a neural network to obtain a noise reduction model, so as to remove noise components in the voice signal.

The image input unit 202 is used for acquiring an image input signal. The image input unit 202 may be a camera, a video camera, or other devices having an image or video capturing function, such as a color camera, a black and white camera, an infrared camera, a 3D camera, or any combination thereof. The image signal acquired by the image input unit 202 includes, but is not limited to, the following:

the face image comprises the characteristics and the expressions of the face of a person in the vehicle;

expression images including facial movements and the like when the user expresses various emotions;

lip images including the movement characteristics of the lips of the speaker, such as opening and closing of the lips, the shape of the mouth, and the like;

the pupil image includes the related actions of the pupil, such as the contraction of the pupil, the focus position, and the like.

Optionally, the image input unit 202 further has a light supplement function, and is configured to enhance brightness and definition of the acquired image under the condition of low ambient brightness.

The sound output unit 203 is configured to generate a sound output signal, which may be an analog voice or a machine sound, such as an alarm sound, music, etc. The sound output unit 203 may be a speaker. The sound output unit may output a sound signal through two modes: the sound output unit 203 is a speaker of an in-vehicle sound box system, and the sound output signal is sent to the in-vehicle personnel through the speaker of the in-vehicle sound box system. The sound output unit 203 may also be an independent sound generating unit independent of the in-vehicle sound box system, and the sound output signal is sent to the in-vehicle people through the independent sound generating unit.

The display unit 204 is configured to display interaction information, which includes but is not limited to: necessary information in the process of man-machine interaction, query information of a user, answer information of questions and answers in the man-machine interaction, expression information and the like. The display unit can be a common terminal display device such as an LCD screen, an LED screen, an OLED screen, a touch screen, etc., which are not illustrated here.

The processing unit 205 is configured to control the voice input unit 201, the image input unit 202, the sound output unit 203, and the display unit 204, and to process the voice input signal and the image input signal and generate the sound signal.

It is to be appreciated that the processing unit 205 may be a processor, which may be a Central Processing Unit (CPU), a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

The communication unit 206 is configured to connect with the cloud service device 300, so as to implement information interaction between the vehicle-mounted terminal device 200 and the cloud service device 300, thereby completing local and network cooperative processing.

It should be noted that the number of the voice input units 201 and the number of the image input units 202 included in the image capturing system 100 shown in fig. 1 are merely an example, and the embodiment of the present invention is not limited thereto. For example, more voice input units 201 and image input units 202 may be included according to the needs of signal acquisition, which are not depicted in the drawings for simplicity.

Optionally, the vehicle-mounted interaction system 100 further includes a sensor, and the sensor is used for being combined with the vehicle-mounted interaction system to implement algorithm fusion. The sensors may include, for example, inertial sensors, velocity sensors, etc., that enable the localization and tracking of signals or sources.

In this embodiment, through the vehicle-mounted interaction system 100, intelligent voice recognition and image recognition based on machine vision and multi-modal information interaction processing can be realized, so that accurate recognition, accurate semantic understanding and personalized output of input information are realized, and human-computer interaction experience of a user is improved.

An important technology for automobile intellectualization is identity recognition, namely, user identity is acquired and confirmed, so that corresponding functions are realized or authorization for use is realized. Such as vehicle start-up authorization, in-vehicle payment authorization, identity-based personalized interaction information, etc., all of which require confirmation of the user's identity as a prerequisite. Fig. 2 and fig. 3 respectively show a schematic diagram and a flowchart of an identity recognition method based on the vehicle-mounted interaction system 100 according to an embodiment of the present invention, and the method according to the embodiment is described in detail below with reference to fig. 2 and fig. 3.

S21, the voice input unit 201 and the image input unit 202 capture a voice input signal and an image input signal, respectively.

Specifically, the voice input unit 201 acquires voice of a person in the vehicle, and the image input unit 202 acquires a facial image of the person in the vehicle.

S22, the processing unit 205 extracts facial features in the image input signal.

Specifically, the processing unit 205 performs image recognition processing on the image input signal, locates and determines a face contour in the image input signal through feature extraction and edge detection, optimizes and enhances a face image in the face contour, performs pattern recognition according to a face model, and extracts a face information feature value. The face model is built based on machine learning, so that the accuracy can be continuously improved as the number of times of machine learning increases.

S23, the processing unit 205 performs facial recognition and matching according to the facial features, and determines the user identity and identity feature information associated with the user identity.

It is understood that the processing unit 205 searches the face information stored in the vehicle-mounted interaction system for the face information matched with the extracted face information feature value, and performs user identity matching according to the mapping relationship between the face information and the user identity, thereby determining the identity of the user, i.e. the face recognition and matching are successful.

Optionally, the processing unit 205 constructs a training model through machine learning and a neural network, and performs recognition and matching of facial information, thereby improving efficiency and accuracy of facial recognition.

Optionally, if the face recognition and matching fails, the processing unit 205 ends the identity recognition process. The processing unit 205 may also re-perform face recognition and matching after the face recognition and matching fails.

In one possible implementation, the identity information includes personalized information of the user, for example, the user's sex, age, personality, hobbies, and the like. The identity information may also include biometric information of the user, such as iris information, fingerprint information, pupil information, voiceprint information, and the like.

S24, the processing unit 205 extracts a voiceprint feature in the speech input signal.

It is understood that the processing unit 205 pre-processes the speech signal, and extracts the voiceprint features in the speech signal through feature extraction and pattern recognition. The pattern recognition can be realized through a voiceprint model, the voiceprint model can be established through technologies such as deep learning and neural network, and the recognition precision is continuously improved along with the increasing of the machine learning times.

S25, the processing unit 205 compares the voiceprint feature with the voiceprint information, and verifies the user identity through the comparison.

Optionally, if the voiceprint feature comparison fails, the processing unit 205 ends the identity recognition process. The processing unit 205 may also perform voiceprint feature comparison again after the voiceprint feature comparison fails.

In the embodiment, the face recognition based on the image and the voiceprint recognition based on the voice are combined with each other to verify the identity of the user, so that the identity recognition accuracy under a high-security level scene (such as an in-vehicle payment scene) is met.

Fig. 4 is a schematic diagram illustrating an in-vehicle positioning method based on the vehicle-mounted interaction system 100 according to an embodiment of the present invention, and the method according to the embodiment is described in detail with reference to fig. 4.

S41, the image input unit 202 acquires an image input signal.

Specifically, the image input unit 202 acquires a face image of a person in the vehicle.

S42, the processing unit 205 extracts the user' S lip motion in the image input signal.

The processing unit 205 recognizes the lip action and the face action of the user through an image recognition technology, and recognizes the position, the emotion and the lip language of the user in the vehicle according to the lip action and the face action.

The processing unit 205 performs image recognition processing on the image input signal, locates and determines lip movements in the image input signal through feature extraction and edge detection, performs pattern recognition according to a lip language model, and extracts lip movement feature values. The lip language model is established based on machine learning, so that the precision can be continuously improved along with the increase of the number of times of machine learning. The machine learning used may be, for example, machine learning using a neural network model which can not only improve the lip language recognition accuracy but also be used for recognition of facial expressions and driving motions applied to the inside of the vehicle. Namely, the neural network model is adopted to simultaneously meet the requirements of multiple facial recognitions, thereby improving the utilization rate of the computing unit. This is one of the innovations of the present invention.

S43, the processing unit 205 determines the position area of the user in the vehicle according to the lip movement and the mapping relationship between the position area in the vehicle and the view angle range of the image input unit 202.

Optionally, calibration is performed when the hardware device of the input unit 202 is installed, and a mapping relationship between the in-vehicle position area and the view angle range of the image input unit 202 is determined. And determining the acquisition visual angle of the lip action by identifying the lip action of the user, so as to determine the position area of the user in the vehicle according to the mapping relation.

In one possible implementation, the positioning method further includes:

s44, the voice input unit 201 collects a voice input signal.

S45, the processing unit 205 performs sound source localization according to the voice input signal, and determines the position of the user in the vehicle.

Optionally, the voice input unit 201 is a microphone array, and sound source localization is realized by the microphone array. The processing unit 205 determines a sound source direction from a Time Difference of Arrival (TDOA) of the voice input signal at each microphone using a Time Difference of Arrival (TDOA) algorithm. The processing unit 205 may also perform sound source localization by using beamforming or high-resolution spectrum estimation, which is not described herein again. The TDOA algorithm is used, the algorithm is creatively used for positioning the sound source in the vehicle, accurate positioning is required to be achieved in the environment due to the fact that the space in the vehicle is narrow, people in the vehicle are close to each other (namely, the sound sources are close to each other) and other interference noise influences exist, and technicians verify the algorithm through a large number of models and experimental data to effectively solve the problem that the sound source in the environment in the vehicle is difficult to position. This is one of the innovations of the present invention.

In the embodiment, the face recognition based on the image and the sound source positioning based on the voice are combined with each other, so that the interactive in-vehicle personnel positioning method is realized, and the positioning accuracy and efficiency are improved.

Fig. 5 and fig. 6 respectively show a schematic diagram and a flowchart of a speech recognition method based on the vehicle-mounted interaction system 100 according to an embodiment of the present invention, and the method according to the embodiment is described in detail below with reference to fig. 5 and fig. 6.

S51, the voice input unit 201 and the image input unit 202 capture a voice input signal and an image input signal, respectively.

Specifically, the voice input unit 201 acquires a voice signal of a person in the vehicle, and the image input unit 202 acquires a facial image of the person in the vehicle.

S52, the processing unit 205 performs lip language recognition and expression recognition according to the image input signal.

The processing unit 205 identifies lip movements and facial movements of the user through an image recognition technology, may identify lip language of the user according to the lip movements, may identify expressions of the user according to the facial movements, and may identify a current emotion of the user through the expressions.

S53, the processing unit 205 performs speech recognition based on the speech input signal.

Optionally, the noise reduction processing is performed on the voice input signal. The noise can be reduced through modes such as frequency spectrum screening, noise reduction filters and the like, and an artificial intelligence algorithm can also be adopted, model training is carried out through machine learning and a neural network, a noise reduction model is obtained, and therefore noise in the voice input signal is removed.

The processing unit 205 may also remove noise other than the sound source according to the positioning result of the in-vehicle positioning method shown in fig. 4, so as to achieve more accurate noise reduction.

S54, the processing unit 205 performs weighted synthesis on the lip language recognition, the expression recognition and the voice recognition result, and generates an output text.

Alternatively, the processing unit 205 may set a weighted value for the results of the lip language recognition, the expression recognition, and the voice recognition, so as to synthesize the recognition result according to the weighted value, and the recognition result is represented by text characters and output. The processing unit 205 may control the display unit 204 to display the text words. The lip language recognition, the expression recognition and the voice recognition are all one item of recognition basis, but each item is not unique or treated averagely, and the proportion of each item can be adjusted according to experimental data to optimize the recognition accuracy, which is one of the innovation points of the invention.

In the embodiment, lip language recognition and expression recognition are assisted on the basis of voice recognition, so that composite voice recognition can be realized, the accuracy and the individuation of voice recognition are improved, and the user experience is improved.

Fig. 7 and fig. 8 respectively show a schematic diagram and a flowchart of a feedback generation method based on the vehicle-mounted interaction system 100 according to an embodiment of the present invention, and the method according to the embodiment is described in detail below with reference to fig. 7 and fig. 8.

S71, the voice input unit 201 and the image input unit 202 respectively capture the voice input signal and the image input signal.

S72, the processing unit 205 determines a user identity and identity feature information associated with the user identity from the voice input signal and the image input signal.

Specifically, the processing unit 205 may determine the user identity and the identity feature information by using the identity recognition method described in fig. 2. As described in step S23, the identity information includes personalized information of the user, such as the user' S sex, age, character, hobbies, etc.

S73, the processing unit 205 performs expression recognition according to the image input signal.

It is understood that the processing unit 205 may recognize the expression of the user according to the facial motion of the user by using the voice recognition method described in fig. 5, and recognize the current emotion of the user through the expression.

S74, the processing unit 205 performs speech recognition based on the speech input signal and the image input signal.

Optionally, the processing unit 205 recognizes the voice input signal of the user according to the voice recognition method described in fig. 5.

S75, the processing unit 205 performs semantic understanding according to the identity feature information, the expression recognition result, and the voice recognition result.

It can be understood that the processing unit 205 can more accurately understand the meaning and intention of the user according to the identity information, emotion and voice recognition result of the user.

S76, the processing unit 205 generates a feedback result according to the semantic understanding result, the identity feature information and the position of the user in the vehicle.

Optionally, the processing unit 205 obtains the position of the user in the vehicle according to the in-vehicle positioning method described in fig. 4.

The feedback result can be voice feedback or display feedback, or can be a combination of the voice feedback and the display feedback. The feedback results also include the orientation angle at which the voice feedback and the display feedback should be used, e.g., the output of the feedback results is oriented to the user's position within the vehicle. The display unit 204 is configured to output the display feedback in the form of text and/or image, and the sound output unit 203 is configured to output the voice feedback in the form of analog voice or machine sound.

In the embodiment, multi-dimensional semantic understanding is realized according to the identity characteristic information, the expression recognition result and the voice recognition result of the user, so that an individualized feedback result can be generated, and the user experience is remarkably improved.

The above description mainly introduces the scheme provided by the embodiment of the present invention from the perspective of interaction between each part in each step. It is to be understood that the respective parts include hardware structures and/or software modules corresponding to perform the respective functions in order to implement the above-described functions. Those of skill in the art will readily appreciate that the invention can be implemented in hardware or a combination of hardware and computer software, with portions of the various examples and algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processing unit, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processing unit such that the processing unit can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processing unit. The processing unit and the storage medium may be configured in an ASIC, and the ASIC may be configured in an operation terminal device.

Those skilled in the art will recognize that the functionality described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three in one or more of the above examples. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A vehicle positioning method is based on vehicle terminal equipment, and the vehicle terminal equipment comprises the following steps:

the voice input unit is used for acquiring a voice input signal;

the image input unit is used for acquiring an image input signal; the image input signal comprises one or more of a face image signal, an expression image signal, a lip image signal and a pupil image signal;

a sound output unit for generating a sound output signal;

the display unit is used for displaying the interactive information;

a processing unit for controlling the voice input unit, the image input unit, the sound output unit, and the display unit, and for processing the voice input signal and the image input signal; wherein the processing unit comprises a machine learning model building unit; the machine learning model establishing unit can establish a machine learning model for one or more of a face image signal, an expression image signal, a lip image signal and a pupil image signal;

the communication unit is used for being connected with the cloud service equipment;

characterized in that the method comprises:

the image input unit acquires the image input signal;

the processing unit extracts lip movements of a user in the image input signal;

and the processing unit determines the position area of the user in the vehicle according to the lip action and the mapping relation between the position area in the vehicle and the visual angle range of the image input unit.

2. The method of claim 1, wherein the speech input unit is further used to remove or reduce noise.

3. The method according to any one of claims 1-2, characterized in that the vehicle-mounted terminal device further comprises a sound emitting unit for emitting the sound output signal.

4. The method of claim 1, wherein the method further comprises:

the voice input unit collects the voice input signal;

and the processing unit carries out sound source positioning according to the voice input signal and determines the position of the user in the vehicle.