CN116080672A

CN116080672A - Man-machine interaction method, related device, system and storage medium

Info

Publication number: CN116080672A
Application number: CN202211099197.XA
Authority: CN
Inventors: 宁传光; 陈云飞; 蒋正中
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2023-05-09

Abstract

The application discloses a man-machine interaction method, a related device, a system and a storage medium, wherein the man-machine interaction method comprises the following steps: in response to the staring interaction function being in an on state, shooting image data containing the face of the user when voice data of the user are collected; identifying based on voice data to obtain a user instruction, and detecting based on a vehicle cabin model and image data to obtain the gaze fixation condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze position is detected, and a duration of each gaze position when the gaze position is detected; responsive to gaze conditions including detecting gaze locations, determining that user instructions need to be executed, and based on the duration of each gaze location, determining a first executed object of the user instructions, and executing the user instructions on the first executed object. According to the scheme, the degree of freedom of switching between man-machine interaction and man-machine interaction can be improved, and accuracy and timeliness of the man-machine interaction are improved.

Description

Man-machine interaction method, related device, system and storage medium

Technical Field

The present disclosure relates to the field of intelligent interaction technologies, and in particular, to a human-computer interaction method, and related devices, systems, and storage media.

Background

Along with the continuous promotion of products and technologies in the field of intelligent automobiles, the requirements of users on intelligent automobiles are also higher and higher. In particular, with the continuous popularization of voice interaction applications, the requirements of users on voice interaction in intelligent automobiles are also increasing.

However, users are increasingly free to use voice interactions, and users desire more freedom in human-machine interactions and switching of human-machine interactions. Current systems typically rely on wake words to initiate human interaction. On one hand, the mainstream system can only achieve short-time free awakening after primary awakening and cannot achieve longer-time free awakening, so that the degree of freedom of switching between man-machine interaction and man-machine interaction is greatly limited, and on the other hand, due to the randomness of a user during voice interaction, a voice instruction may not be complete (for example, the window is not explicitly opened by the voice instruction), so that the accuracy and instantaneity of the voice interaction are affected. In view of this, how to improve the degree of freedom of human-computer interaction and human-computer interaction, and improve the accuracy and timeliness of human-computer interaction becomes a problem to be solved.

Disclosure of Invention

The technical problem that this application mainly solves is to provide a man-machine interaction method and relevant device, system and storage medium, can promote the degree of freedom that switches between man-machine interaction and the man-machine interaction to promote man-machine interaction's accuracy and timeliness.

In order to solve the above technical problems, a first aspect of the present application provides a human-computer interaction method, including: in response to the staring interaction function being in an on state, shooting image data containing the face of the user when voice data of the user are collected; identifying based on voice data to obtain a user instruction, and detecting based on a vehicle cabin model and image data to obtain the gaze fixation condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze position is detected, and a duration of each gaze position when the gaze position is detected; responsive to gaze conditions including detecting gaze locations, determining that user instructions need to be executed, and based on the duration of each gaze location, determining a first executed object of the user instructions, and executing the user instructions on the first executed object.

In order to solve the above technical problem, a second aspect of the present application provides a human-computer interaction device, including: the system comprises a data acquisition module, a voice recognition module, a sight detection module and an instruction execution module, wherein the data acquisition module is used for responding to the staring interaction function in an on state and shooting image data containing the face of a user when voice data of the user are acquired; the voice recognition module is used for recognizing based on voice data to obtain a user instruction; the sight line detection module is used for detecting based on the vehicle cabin model and the image data to obtain the sight line staring condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze position is detected, and a duration of each gaze position when the gaze position is detected; an instruction execution module for determining that user instructions need to be executed in response to gaze conditions including detection of gaze locations, and based on a duration of each gaze location, determining a first executed object of the user instructions, and executing the user instructions on the first executed object.

In order to solve the above technical problems, a third aspect of the present application provides a human-computer interaction system, which includes a microphone, a camera and a vehicle, wherein the microphone and the camera are respectively coupled with the vehicle, the microphone is used for collecting voice data, the camera is used for capturing image data, and the vehicle is used for executing the human-computer interaction method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer readable storage medium storing program instructions executable by a processor, where the program instructions are configured to implement the man-machine interaction method of the first aspect.

According to the scheme, when the staring interaction function is in an on state, the image data containing the face of the user is shot when the voice data of the user are acquired, so that the user command is obtained by identifying based on the voice data, the staring condition of the user in the vehicle is obtained based on the vehicle cabin model and the image data, the staring condition of the sight comprises whether the staring position is detected or not, the duration of each staring position when the staring position is detected, the user command is determined to be executed based on the staring condition, the first executed object of the user command is determined based on the duration of each staring position, and the user command is executed for the first executed object, so that the human-computer interaction is realized by combining the staring of the sight and the voice of the user, the human-computer interaction is not needed to be switched to by depending on a wake word, the degree of freedom of switching between the human-computer interaction and the human interaction is improved, the first executed object of the user command is determined by the duration of each staring position, the staring position is detected, the first executed object of the user command is focused when a plurality of places of staring positions are detected, the interference on the machine interaction is reduced as much as possible, and the human-computer interaction is promoted in time.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a human-computer interaction method of the present application;

FIG. 2 is a process diagram of one embodiment of identifying user instructions;

FIG. 3 is a process diagram of another embodiment of identifying user instructions;

FIG. 4 is a process schematic of one embodiment of a gaze interaction preparation phase;

FIG. 5 is a schematic diagram of a human-machine interaction device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a framework of one embodiment of a human-machine interaction system of the present application;

FIG. 7 is a schematic diagram of a framework of one embodiment of a computer readable storage medium of the present application.

Detailed Description

The following describes the embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "/" herein generally indicates that the associated object is an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flow chart illustrating an embodiment of a human-computer interaction method of the present application. It should be noted that, in the embodiment of the present disclosure, the man-machine interaction method may be executed by a vehicle, and the vehicle may be operated with a vehicle system based on an embedded system such as Android, winCE, which is not limited herein. Specifically, embodiments of the present disclosure may include the steps of:

step S11: and in response to the staring interaction function being in an on state, shooting image data containing the face of the user when voice data of the user are acquired.

In one implementation scenario, the vehicle may include a touch screen through which the user may set the gaze interaction function on or off. Specifically, in the case of starting the gaze interaction function, man-machine interaction may be achieved through steps in embodiments of the present disclosure, and reference may be made to embodiments of the present disclosure; in the case of turning off the gaze interaction function, the human-computer interaction may be realized by relying on the wake-up word, and the like, and specific details related to the wake-up word may be referred to, which will not be described herein.

In one implementation scenario, where the gaze interaction function is in an on state, the microphone and camera may be in an active state all the time, the microphone may be used to capture audio, and the camera may capture images of the user's head. Meanwhile, the vehicle-mounted device can detect voice activity of the collected audio to detect the starting time and the ending time of speaking of the user, intercept the audio from the starting time to the ending time to obtain voice data of the user, intercept images from the starting time to the ending time to obtain image data.

In another implementation scenario, unlike the foregoing manner, in the case where the gaze interaction function is in an on state, the microphone may be in an active state all the time, and while the audio is collected, the vehicle may keep performing voice activity detection on the collected audio to detect a start time of user speaking, and in response to detecting the start time, turn on the camera to capture an image on the head of the user, and when the vehicle detects an end time of user speaking, turn off the camera, so that the audio from the start time to the end time may be used as voice data of the user, and the image from the start time to the end time may be used as image data.

It should be noted that, when the user triggers to start the gaze interaction function, the user may prompt that the gaze interaction function needs hardware support such as a microphone and a camera, and in the start process of the gaze interaction function, information such as voice and image of the user will be collected, for example, the gaze interaction function is confirmed to be started, and is considered to be known and agrees to be collected, and the gaze interaction function may be kept in an on state under the condition that the user confirms agreement, otherwise, the gaze interaction function may be switched to an off state if the user does not agree.

Step S12: and identifying based on the voice data to obtain a user instruction, and detecting based on the vehicle cabin model and the image data to obtain the sight gaze condition of the user in the vehicle.

In one implementation scenario, only voice data may be identified, resulting in user instructions. To improve recognition efficiency, a speech recognition model may be pre-trained, which may include, but is not limited to: the cyclic neural network, the time delay neural network, etc. may of course also adopt a model framework such as an Encoder-Decoder (i.e., an Encoder-Decoder), a CTC (i.e., connectionist Temporal Classification), etc., and the network structure of the speech recognition model is not limited herein. The sample voice can be collected in advance and labeled with the corresponding sample text, the sample voice can be input into a voice recognition model to be recognized on the basis of the sample voice, the recognition text corresponding to the sample voice can be obtained, and the network parameters of the voice recognition model can be adjusted on the basis of the difference between the sample text and the recognition text. After the training convergence of the speech recognition model, the speech data can be input into the speech recognition model, and the user instruction can be recognized.

In another implementation scenario, please refer to fig. 2 in combination, fig. 2 is a process diagram illustrating an embodiment of identifying user instructions. As shown in fig. 2, in order to further improve the recognition accuracy, the user instruction may be obtained based on the voice data and the lip image extracted from the image data, so that the user instruction may be obtained by combining the multi-mode data for common recognition, which is helpful for improving the recognition accuracy. Specifically, the first feature extraction may be performed on the current lip image to obtain a first image feature, where the first image feature includes first sub-features of a plurality of channels, the first sub-feature of the first image feature extracted from the current lip image that is located in the target channel is replaced with the first sub-feature of the first image feature extracted from the reference lip image that is located in the target channel, so as to obtain a second image feature, and the reference lip image is located before the current lip image. On the basis, second feature extraction is performed based on the second image features to obtain image features of the current lip image, so that user instructions can be obtained based on the voice features extracted from the voice data of the image features of each lip image. According to the method, the first sub-feature of the first image feature extracted by the current lip image and the first sub-feature of the first image feature extracted by the reference lip image and the first sub-feature of the target channel are replaced, so that the second image feature is obtained, the time sequence context information of the previous frame can be fully utilized to assist in feature extraction of the current frame, association between video frames is facilitated to be constructed, recognition is further conducted by combining the image feature and the voice feature, recognition accuracy is facilitated to be improved, and particularly in complex scenes such as noisy noise, voice recognition accuracy can be improved greatly.

In a specific implementation scenario, referring to fig. 3 in combination, fig. 3 is a schematic process diagram of another embodiment of identifying user instructions. As shown in fig. 3, the image data may include a plurality of video frames, and face detection may be performed on each video frame to detect a face region in the video frame, and then feature point detection may be performed on the face region to detect feature points (e.g., 68 key points of the face) in the face region, so that a lip image may be extracted from the video frame based on the feature points related to lips.

In one specific implementation scenario, the target channel may be the first 1/8 channel, and if the first image feature comprises the first sub-feature of 16 channels, the first 2 channels may be selected as the target channels. Of course, in the practical application process, the method is not limited thereto, and the first 1/4 channel or the first 1/2 channel may be selected as the target channel, which is not limited herein.

In a specific implementation scenario, please continue to refer to fig. 3, the first feature extraction and the second feature extraction may be performed based on a backbone network of mobilet, which is not limited to mobilet, but may also include, but not limited to, res net, etc., which is not limited herein. The backbone network may include a number of sequentially connected network blocks (blocks) for feature extraction. Based on the backbone network, a Time convolution (Time Conv) and a gating loop unit (Gate Recurrent Unit, GRU) can be further combined. Taking the example that the backbone network comprises two network blocks and the image data is extracted to obtain N frames of lip images, the 1 st to N frames of lip images are subjected to first feature extraction by the 1 st network block, the first image features of the 1 st to N frames of lip images can be respectively extracted, and can be respectively marked as f for convenience of distinguishing _{1_1} 、……、f _{1_N} Further, for the ith frame lip image, the first image feature f thereof may be added _{1_i} The first sub-feature of the first 1/8 channel is replaced by the first image feature f of the i-1 frame lip image _{1_i-1} The first sub-feature of the front 1/8 channel is used for obtaining the second image feature f of the lip image of the ith frame _{2_i} The same can be said when the backbone network contains other numbers of network blocks or other frames of lip images, and this is not an example.

In a specific implementation scenario, please continue to refer to fig. 3, fb40 features of the voice data may be extracted as audio features. Of course, in the practical application process, other acoustic features such as MFCC (Mel Frequency Cepstral Coefficient, mel-frequency cepstral coefficient) may be extracted as the audio feature, which is not limited herein.

In a specific implementation scenario, please continue to refer to fig. 3, in order to facilitate combining the audio feature and the image feature, the image feature and the audio feature may be further input into a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a long-short-term memory network mapping Layer (LSTMP), etc. for processing, so as to map to the same feature space and perform fusion, and then the speech recognition network is used to process the fusion feature to obtain the user instruction. For a specific structure of the voice recognition network, reference may be made to the foregoing related description, which is not repeated here.

In the disclosed embodiments, gaze fixation conditions may include whether a gaze location is detected, and the duration of each gaze location when a gaze location is detected. It should be noted that the duration specifically characterizes the duration of the continuous gaze of the user at the gaze location. Specifically, analysis may be performed based on the image data to obtain spatial position of the pupil and pose information of the head, and an eye image may be extracted based on the image data. On the basis, the attitude information and the eye images can be detected based on the vision detection model to obtain the vision direction, so that the vision staring condition can be obtained based on the space position, the vision direction and the vehicle cabin model, and the staring position is the falling point position of the vehicle cabin model along the vision direction from the space position. According to the method, the gaze fixation condition is determined by combining the spatial position of the pupil, the gaze direction of the user and the cabin model of the vehicle, so that the gaze fixation interaction accuracy is improved.

In one implementation scenario, referring to fig. 4, fig. 4 is a process diagram of an embodiment of a gaze interaction preparation phase. As shown in fig. 4, in order to further improve the accuracy of gaze detection, the camera may be calibrated in advance to obtain internal parameters of the camera. Further, a calibration plate of an integrated laser range finder may be employed to model the capsule in three dimensions and correlate the camera coordinate system with the capsule coordinate system. Finally, for the deviation caused by installation errors and user adjustment after the vehicle is delivered, a fixed reference object (such as an instrument panel and the like) in the cabin can be used as a cabin coordinate system reference point, and the angle of the camera relative to the cabin coordinate system can be calculated reversely by utilizing the parallax principle. It should be noted that, through the above-mentioned processing, the customization of gazing perception area can be carried out to project real vehicle overall arrangement, satisfies different demands, can realize gazing perception of positions such as interior rear-view mirror, left/right rear-view mirror, well accuse screen, panel board, road surface, HUB, left/right door window.

In one implementation, as described above, the camera coordinate system may be calibrated in advance, and based on this, the three-dimensional modeling may be performed on the face based on the image data in combination with the camera coordinate system, so that the spatial position of the pupil may be located. For a specific process of three-dimensional modeling, reference may be made to technical details of modeling modes such as SFM (Structure From Motion), which are not described herein.

In one implementation scenario, as described above, feature point detection (e.g., 68 key points of a face) may be performed on the image data, so that an eye image may be extracted based on feature points related to eyes, and specifically, reference may be made to the foregoing lip image extraction process, which may not be described in detail.

In one implementation scenario, the pose information of the head may include, but is not limited to: head yaw (i.e., yaw angle), roll (i.e., roll angle), pitch (i.e., pitch angle), etc., are not limited herein.

In one implementation scenario, the line-of-sight detection model may include, but is not limited to, a convolutional neural network, etc., where the network structure of the line-of-sight detection model is not limited. On the basis, the eye images can be analyzed by using a sight line detection model, and the sight line direction can be detected by fusing the pose information of the head.

In one implementation scenario, as described above, the image data may include a plurality of video frames, and after detecting the spatial position of the pupil in each video frame and the direction of the line of sight of the user, the image data may extend from the spatial position of the pupil in the direction of the line of sight, determine the intersection point on the cabin model, and if the intersection points corresponding to the N consecutive video frames are all on the same vehicle component, use the position of the vehicle component as the position of the line of sight of the user when the N consecutive frames are located, use the gaze position, and use the duration of the N consecutive frames as the duration of the gaze position.

Step S13: responsive to gaze conditions including detecting gaze locations, determining that user instructions need to be executed, and based on the duration of each gaze location, determining a first executed object of the user instructions, and executing the user instructions on the first executed object.

In one implementation scenario, where the gaze fixation includes detection of a gaze location, the gaze location in the gaze fixation may be selected as the first target location, and in response to a duration of the first target location meeting a first condition, the vehicle component at the first target location is determined to be the first executed object. Specifically, the first condition may be set to include that the time-series duration is not less than the first duration. By way of example, the first time period may be set to 0.5 seconds, 1 second, 2 seconds, etc., without limitation. In the above manner, by selecting the gaze position in the gaze condition as the first target position, and determining the vehicle component located at the first target position as the first executed object in response to the duration of the first target position satisfying the first condition, accuracy in determining the executed object can be improved.

In one particular implementation scenario, where the gaze fixation scenario includes multiple gaze locations, the last detected gaze location in the gaze fixation scenario may be selected as the first target location. In the above manner, in the case that the gaze fixation includes a plurality of gaze positions, the gaze position detected last in the gaze fixation is selected as the first target position, so that pre-gaze interference can be reduced as much as possible, and the accuracy of gaze interaction can be improved. Furthermore, in case the gaze fixation situation comprises only one gaze position, the gaze position may be directly selected as the first target position.

In one particular implementation, it may be determined that user instructions need not be executed in response to the duration of the first target location not meeting the first condition. At this time, no response may be made. According to the method, the fact that the user instruction is not required to be executed is determined in response to the fact that the duration of the first target position does not meet the first condition, interference of short stay of the sight line on staring interaction can be eliminated, and accuracy of staring interaction is further improved.

In one implementation scenario, during actual application, the gaze fixation condition may also include a non-detected gaze location, and then it may be determined that user instructions need not be executed in response to the gaze fixation condition including the non-detected gaze location. At this time, no response may be made. According to the method, the gaze fixation condition comprises the fact that the gaze position is not detected, the fact that the user instruction is not needed to be executed is determined, interference of voice data generated when the gaze does not stay on man-machine interaction can be eliminated, and accuracy of the man-machine interaction is further improved.

In one implementation scenario, with the driver staring at the co-driver window, say "windowing", voice data "windowing" may be collected and image data containing the driver's face captured. On the basis, the user command (namely, opening the vehicle window) can be obtained by identifying based on the voice data, the sight-line staring condition of the user in the vehicle is obtained by detecting based on the vehicle cabin model and the image data, the sight-line staring condition comprises the staring position detected, the staring position is positioned in the auxiliary driving window, the duration time is 1 second, the staring position can be detected in response to the sight-line staring condition, the staring position can be used as a first target position because only one staring position exists, and the vehicle component (namely, the auxiliary driving window) at the first target position can be used as a first executed object and the user command is executed on the first executed object because the duration time of the staring position meets the first condition, so that the auxiliary driving window is opened. Other situations can be similar and are not exemplified here. Through practical tests, the accuracy of human-computer interaction can reach more than 90%, the response time is shortened to be within 600 milliseconds, and the accuracy error of the gaze position is within 3 degrees.

In one implementation scenario, in the event that voice data is not acquired and a user trigger control key is detected, it may be detected whether the control key has an associated adjustment object, and in response to the control key not being associated with an adjustment object and the gaze condition including detecting a gaze location, a gaze location having a duration meeting a second condition may be selected as a second target location and a vehicle component located at the second target location is determined as a second executed object, and for the second executed object, control instructions defined by the control key are executed. According to the method, under the condition that voice data are not collected and the control key is detected to be triggered by a user, whether the control key is related to the adjustment object is detected, so that when the control key is not related to the adjustment object and the gaze fixation condition comprises the detection of the gaze position, the gaze position with the duration meeting the second condition is further selected to serve as the second target position, and the operation executed by the follow-up instruction is performed based on the gaze position, so that the operation can be preferentially executed under the condition that the gaze and the vehicle control exist, the multi-mode scene with voice and images can be simultaneously met, and the single-mode scene with voice only can be realized, and the use range of man-machine interaction is promoted.

In one specific implementation scenario, the control key may be either a physical key of the vehicle or a virtual key of the vehicle. For example, a physical key defined as "cooling" may be integrated on the vehicle, or a virtual key defined as "navigation" may be provided on the touch screen of the vehicle, which is not exemplified here.

In a specific implementation scenario, the adjustment object is a vehicle component such as a secondary driving window, a primary driving window, an air conditioner, or the like. In addition, the control key may be associated with an adjustment object in advance, such as an entity key defined as "cooling" as described above, which is associated with an adjustment object "air conditioning" in advance, and the like, and is not exemplified here. In addition, the control key may not be associated with an adjustment object in advance, for example, a virtual key defined as "window opening" may be provided on the touch screen of the vehicle, but the adjustment object is not associated with the control key, specifically, a "primary driving window", a "secondary driving window", or a "rear seat window", and the like, and other situations may be similar, which are not exemplified herein.

In a specific implementation scenario, the second condition may include a duration not less than a second duration, which may be set to 0.5 seconds, 1 second, 2 seconds, etc., without limitation. That is, if the duration of the gaze location is less than the second duration, the gaze location may be considered an invalid location, which is not considered a second target location.

In a specific implementation scenario, if the gaze fixation includes a plurality of gaze locations having a duration that satisfies the second condition, the last detected gaze location (i.e., the last detected gaze location among the gaze locations having a duration that satisfies the second condition) may be used as the second target location, and the vehicle component at the second target location may be determined as the second executed object. For example, the user may switch to gazing at the right mirror for a second period of time after gazing at the left mirror for the second period of time, where the right mirror may be the second object to be executed, and the other cases may be the same, which are not exemplified here. According to the method, under the condition that the gaze fixation condition comprises a plurality of gaze positions with duration meeting the second condition, the last detected gaze position is selected as the second target position, so that ambiguity can be eliminated as far as possible under the condition that the plurality of duration meets the second condition, and the accuracy of human-computer interaction response is improved.

In a specific implementation scenario, unlike the foregoing manner, in a case where it is detected that the control key is not associated with the adjustment object, the adjustment object associated with the control key may be directly determined as a third executed object, and the control instruction defined by the control key is executed for the third executed object. According to the mode, the control key is not associated with the adjusting object, the adjusting object associated with the control key is directly determined to be the third executed object, and the control instruction defined by the control key is executed for the third executed object, so that man-machine interaction response can be directly carried out under the condition that the control key is associated with the adjusting object, and the response speed is improved.

In a specific implementation scenario, for convenience of subsequent use, in a case where the control key is not associated with the adjustment object, if the gaze fixation condition includes detecting the gaze position, as described above, the gaze position with the duration meeting the second condition may be selected as the second target position, and the vehicle component located at the second target position may be determined as the second executed object, and at the same time, the second executed object may be associated with the adjustment object of the control key, so that when the user triggers the control key next time, the associated adjustment object may be directly used as the third executed object, and the control instruction defined by the control key may be executed on the third executed object. Further, when the user needs to be disassociated, the adjustment object associated with the control key can be disassociated through the car machine touch screen. Of course, after the release is successful, the new adjustment object may be associated again through the above steps, and the specific steps may refer to the foregoing description, which is not repeated herein.

In one particular implementation scenario, in response to the control key not being associated with the adjustment object and the gaze condition including a gaze location not being detected, the control key may be deemed to be an invalid trigger and the control instruction defined by the control key may not be executed. According to the method, under the condition that the control keys are not associated with the adjustment object and the gaze fixation condition comprises that the gaze position is not detected, the control instructions defined by the control keys are not executed, invalid triggers can be effectively filtered, and accuracy of human-computer interaction response is improved.

In a specific implementation scenario, in response to the control key not being associated with the adjustment object and the gaze condition comprising detection of gaze locations, the control key may not be considered to be activated and the control instruction defined by the control key may not be executed in case the duration of the respective gaze locations does not satisfy the second condition. According to the method, when the control keys are not associated with the adjustment object and the gaze fixation condition comprises the detection of the gaze position, but the duration of each gaze position does not meet the second condition, the control instruction defined by the control keys is not executed, invalid triggers can be effectively filtered, and the accuracy of human-computer interaction response is improved.

In a specific implementation scenario, taking as an example a control key defined as a control instruction "inwardly adjust rearview mirror", when it is detected that a user triggers the control key, it may be detected whether the control key has been associated with an adjustment object, for example, if the control key has been associated with an adjustment object "right rearview mirror", then the associated adjustment object "right rearview mirror" may be directly determined as a third executed object, and for the third executed object "right rearview mirror", the control instruction "inwardly adjust rearview mirror" defined by the control key is executed, and in the case that the control key has been associated with another adjustment object (e.g., left rearview mirror), the same may be said, and no further examples are given here. Conversely, if the control key is not associated with the adjustment object, the gaze position may be detected in response to the gaze condition, the gaze position having the duration meeting the second condition (e.g., not less than 2 seconds) may be selected as the second target position, and the vehicle component at the second target position is determined, e.g., the left rearview mirror may be regarded as the second executed object, and the control instruction defined by the control key, i.e., "inwardly adjust the left rearview mirror", may be executed for the second executed object, "left rearview mirror", so as to implement the human-machine interaction, "inwardly adjust the left rearview mirror", or, in the case where the vehicle component at the second target position is determined to be the right rearview mirror, "right rearview mirror" may be regarded as the second executed object, and the control instruction defined by the control key, i.e., "inwardly adjust the rearview mirror", may be executed for the second executed object, "right rearview mirror", so as to implement the human-machine interaction, "inwardly adjust the right rearview mirror". Other situations can be similar and are not exemplified here.

In one implementation scenario, in the case where the gaze interaction function is in an on state, it may be further detected whether the vehicle gear satisfies the third condition, specifically, the third condition may be set to be that the vehicle gear is in P-gear or N-gear. If the vehicle gear satisfies the third condition, the step of capturing the image data including the face of the user when the voice data of the user is acquired and the subsequent step may be performed, otherwise, if the vehicle gear does not satisfy the third condition, the "currently in driving state, the gaze interaction function is not supported" may be prompted, which is helpful for improving driving safety. In addition, in order to further improve driving safety, after the user triggers the gaze interaction function and is in an on state, "you have turned on the gaze interaction function, in order to ensure driving safety, the gaze interaction function is only effective when the vehicle is in P gear or N gear, and you are attentive to road safety during driving.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a frame of an embodiment of a man-machine interaction device 50 according to the present application. The man-machine interaction device 50 includes: the system comprises a data acquisition module 51, a voice recognition module 52, a sight line detection module 53 and an instruction execution module 54, wherein the data acquisition module 51 is used for responding to the staring interaction function in an on state and shooting image data containing the face of a user when voice data of the user are acquired; the voice recognition module 52 is configured to recognize based on voice data to obtain a user instruction; the sight line detection module 53 is configured to detect, based on the cabin model and the image data, a sight line staring condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze position is detected, and a duration of each gaze position when the gaze position is detected; instruction execution module 54 is configured to determine that user instructions need to be executed in response to gaze conditions including detection of gaze locations, and based on the duration of each gaze location, determine a first executed object of user instructions, and execute the user instructions on the first executed object.

In some disclosed embodiments, the instruction execution module 54 includes a location selection sub-module for selecting a gaze location in a gaze fixation scenario as the first target location, and the instruction execution module 54 includes an object determination sub-module for determining that the vehicle component located at the first target location is the first executed object in response to a duration of the first target location meeting the first condition.

Therefore, by selecting the gaze position in the gaze situation as the first target position, and determining the vehicle component located at the first target position as the first executed object in response to the duration of the first target position meeting the first condition, the accuracy of determining the executed object can be improved.

In some disclosed embodiments, where the gaze fixation situation includes a plurality of gaze locations, the first target location is the last detected gaze location in the gaze fixation situation.

Therefore, in the case that the gaze fixation includes a plurality of gaze positions, the last detected gaze position in the gaze fixation is selected as the first target position, so that pre-gaze interference can be reduced as much as possible, and the accuracy of gaze interaction can be improved.

In some disclosed embodiments, the instruction execution module 54 includes a mute processing sub-module for determining that user instructions are not required to be executed in response to the duration of the first target location not meeting the first condition.

Therefore, in response to the duration of the first target position not meeting the first condition, it is determined that the user instruction is not required to be executed, and interference of short stay of the line of sight on staring interaction can be eliminated, so that accuracy of staring interaction is further improved.

In some disclosed embodiments, the human-computer interaction device 50 further comprises an object detection module for detecting whether the control key has an associated adjustment object in case no speech data is acquired and a user-triggered control key is detected, the human-computer interaction device 50 further comprises a first control module for selecting, as the second target position, a gaze position having a duration meeting the second condition in response to the control key not being associated with the adjustment object and the gaze condition comprising a detected gaze position, and determining, as the second executed object, a vehicle component located at the second target position, and executing control instructions defined by the control key to the second executed object.

Therefore, under the condition that voice data is not collected and a user is detected to trigger the control key, whether the control key is related to the adjustment object is detected, and therefore, under the condition that the control key is not related to the adjustment object and the gaze fixation condition comprises detection of the gaze position, the gaze position with the duration meeting the second condition is further selected to serve as the second target position, and subsequent instruction execution operation is conducted based on the gaze position, the operation can be preferentially conducted under the condition that the gaze and the vehicle control exist, and further, a multi-mode scene with voice and image and a single-mode scene with voice only can be met at the same time, and the use range of man-machine interaction is promoted.

In some disclosed embodiments, the human-computer interaction device 50 further includes a second control module, configured to, in response to the control key being associated with the adjustment object, directly determine the adjustment object associated with the control key as a third executed object, and execute the control instruction defined by the control key on the third executed object.

Therefore, in response to the fact that the control key is not associated with the adjustment object, the adjustment object associated with the control key is directly determined to be the third executed object, and the control instruction defined by the control key is executed for the third executed object, so that man-machine interaction response can be directly carried out under the condition that the control key is associated with the adjustment object, and the response speed is improved.

In some disclosed embodiments, the human-machine interaction device 50 further comprises a first silencing module for not executing control instructions defined by the control keys in response to the control keys not being associated with the adjustment object and the gaze condition comprising a non-detected gaze location.

Therefore, under the condition that the control keys are not associated with the adjustment object and the gaze fixation condition comprises that the gaze position is not detected, the control instructions defined by the control keys are not executed, invalid triggers can be effectively filtered, and accuracy of human-computer interaction response is improved.

In some disclosed embodiments, the human-machine interaction device 50 further comprises a second silencing module for, in response to the control key not being associated with the adjustment object and the gaze condition comprising detection of gaze locations, not executing control instructions defined by the control key in case neither of the duration of the respective gaze locations satisfies the second condition.

Therefore, when the control key is not associated with the adjustment object and the gaze fixation condition includes detection of the gaze position, but the duration of each gaze position does not meet the second condition, the control instruction defined by the control key is not executed, invalid triggers can be effectively filtered, and accuracy of human-computer interaction response is improved.

In some disclosed embodiments, the human-machine interaction device 50 further comprises a silence response module for determining that user instructions need not be executed in response to the gaze condition including a non-detected gaze location.

Therefore, the gaze fixation condition comprises the fact that the gaze position is not detected, the fact that user instructions are not needed to be executed is determined, interference of voice data generated when the gaze is not stopped on human-computer interaction can be eliminated, and accuracy of the human-computer interaction is further improved.

In some disclosed embodiments, the user instructions are based on speech data and lip image recognition extracted from the image data.

Therefore, the user instruction is obtained based on the voice data and the lip image recognition extracted from the image data, so that the user instruction can be obtained by combining the multi-mode data for common recognition, and the recognition precision is improved.

In some disclosed embodiments, the speech recognition module 52 includes a first extraction sub-module for performing a first feature extraction on the current lip image resulting in a first image feature; wherein the first image feature comprises a first sub-feature of the plurality of channels; the voice recognition module 52 includes a feature replacement sub-module, configured to replace a first sub-feature in the target channel of the first image feature extracted by the current lip image with a first sub-feature in the target channel of the first image feature extracted by the reference lip image, to obtain a second image feature; wherein the reference lip image is located before the current lip image; the voice recognition module 52 includes a second extraction sub-module for performing second feature extraction based on the second image feature to obtain an image feature of the current lip image; the voice recognition module 52 includes an instruction recognition sub-module for deriving user instructions based on the image features of the respective lip images and the extracted voice features of the voice data.

Therefore, the first sub-feature of the first image feature extracted by the current lip image and the first sub-feature of the first image feature extracted by the reference lip image and the first sub-feature of the target channel are replaced to obtain the second image feature, so that the time sequence context information of the previous frame can be fully utilized to assist the feature extraction of the current frame, the association between video frames can be constructed, the image features and the voice features can be further combined for recognition, the recognition precision can be improved, and particularly, the voice recognition precision can be improved greatly under the complex scenes such as noisy noise.

In some disclosed embodiments, the gaze detection module 53 includes an image analysis sub-module for performing an analysis based on the image data, obtaining spatial position of the pupil and pose information of the head, and extracting an eye image based on the image data; the sight line detection module 53 comprises a direction detection sub-module, which is used for detecting the gesture information and the eye image based on the sight line detection model to obtain the sight line direction; the sight line detection module 53 comprises a condition determination sub-module for obtaining a sight line gaze condition based on the spatial position, the sight line direction and the vehicle cabin model; the gaze position is a position of a landing point of the vehicle cabin model in a sight line direction from the spatial position.

Therefore, the gaze fixation condition is determined by combining the spatial position of the pupil, the gaze direction of the user and the cabin model of the vehicle, so that the gaze fixation interaction accuracy is improved.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating a framework of an embodiment of a man-machine interaction system 60 according to the present application. The man-machine interaction system 60 comprises a microphone 61, a camera 62 and a car machine 63, which are coupled to each other, wherein the microphone 61 is used for collecting voice data, the camera 62 is used for shooting image data, and the car machine 63 is used for executing the steps in any of the above-mentioned man-machine interaction method embodiments.

Specifically, the vehicle 63 is configured to control itself, as well as the microphone 61 and the camera 62, to implement the steps in any of the above-described embodiments of the human-computer interaction method. The vehicle 63 may be integrated with a processor (not shown), which may be referred to as a CPU (Central Processing Unit ). The processor may be an integrated circuit chip having signal processing capabilities. The processor may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor may be commonly implemented by an integrated circuit chip.

According to the scheme, on one hand, the human-computer interaction is realized by combining the gaze fixation and the user voice, so that the human-computer interaction is switched to without depending on wake-up words, and the degree of freedom of switching between the human-computer interaction and the human-computer interaction can be improved, on the other hand, the incompleteness of a voice instruction can be made up through the gaze fixation, the first executed object of the user instruction is determined by means of the duration of each gaze position, the interference to the machine response can be reduced as much as possible when a plurality of gaze positions exist, and the accuracy and the timeliness of the human-computer interaction are improved.

Referring to FIG. 7, FIG. 7 is a schematic diagram illustrating an embodiment of a computer-readable storage medium 70 of the present application. The computer readable storage medium 70 stores program instructions 71 executable by a processor, the program instructions 71 for implementing the steps in any of the human interaction method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information, and obtains independent consent of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

Claims

1. A human-computer interaction method, comprising:

In response to the staring interaction function being in an on state, shooting image data containing the face of the user when voice data of the user are collected;

identifying based on the voice data to obtain a user instruction, and detecting based on a vehicle cabin model and the image data to obtain the gaze fixation condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze location is detected, and a duration of each of the gaze locations upon detection of the gaze location;

in response to the gaze fixation conditions including detecting the gaze locations, determining that the user instructions need to be executed, and determining a first executed object of the user instructions based on a duration of each of the gaze locations, and executing the user instructions for the first executed object.

2. The method of claim 1, wherein the determining the first executed object of the user instruction based on the duration of each of the gaze locations comprises:

selecting a gaze location in the gaze fixation scenario as a first target location;

in response to the duration of the first target location meeting a first condition, determining that the vehicle component located at the first target location is the first executed object.

3. The method of claim 2, wherein the first target location is a last detected gaze location in the gaze fixation, where the gaze fixation includes a plurality of the gaze locations.

4. The method according to claim 2, wherein the method further comprises:

and determining that the user instruction is not required to be executed in response to the duration of the first target position not meeting the first condition.

5. The method of claim 1, wherein in the event that the voice data is not collected and a user-activated control key is detected, the method further comprises:

detecting whether the control key is associated with an adjustment object or not;

in response to the control key not being associated with the adjustment object and the gaze fixation condition including detecting the gaze location, selecting a gaze location at which the duration satisfies a second condition as a second target location and determining a vehicle component located at the second target location as a second executed object, and executing control instructions defined by the control key for the second executed object.

6. The method of claim 5, wherein the method further comprises:

And responding to the control key to be associated with the adjustment object, directly determining the adjustment object associated with the control key as a third executed object, and executing a control instruction defined by the control key on the third executed object.

7. The method of claim 5, further comprising at least one of:

responsive to the control key not being associated with the adjustment object and the gaze fixation condition comprising the non-detection of the gaze location, not executing control instructions defined by the control key;

in response to the control key not being associated with the adjustment object and the gaze fixation condition comprising detecting the gaze locations, if the duration of each of the gaze locations does not satisfy the second condition, not executing control instructions defined by the control key.

8. The method according to claim 1, wherein the method further comprises:

responsive to the gaze fixation condition including the non-detection of the gaze location, it is determined that the user instruction need not be executed.

9. The method of claim 1, wherein the user instructions are identified based on the voice data and lip images extracted from the image data.

10. The method of claim 9, wherein the step of identifying the user instruction comprises extracting a plurality of the lip images from the image data, the step of identifying the lip images comprising:

performing first feature extraction on the current lip image to obtain first image features; wherein the first image feature comprises a first sub-feature of a number of channels;

replacing the first sub-feature in the target channel in the first image feature extracted by the current lip image with the first sub-feature in the target channel in the first image feature extracted by the reference lip image to obtain a second image feature; wherein the reference lip image is located before the current lip image;

performing second feature extraction based on the second image features to obtain image features of the current lip image;

and obtaining the user instruction based on the image features of the lip images and the voice features extracted by the voice data.

11. The method according to claim 1, wherein the detecting based on the cabin model and the image data to obtain the gaze condition of the user in the vehicle comprises:

analyzing based on the image data to obtain the spatial position of the pupil and the posture information of the head, and extracting an eye image based on the image data;

Detecting the attitude information and the eye image based on a sight line detection model to obtain a sight line direction;

obtaining the gaze fixation condition based on the spatial position, the gaze direction and the vehicle cabin model; the gaze location is a falling point location of the cabin model along the line of sight direction from the spatial location.

12. A human-machine interaction device, comprising:

the data acquisition module is used for responding to the staring interaction function in an on state, and shooting image data containing the face of the user when the voice data of the user are acquired;

the voice recognition module is used for recognizing based on the voice data to obtain a user instruction;

the sight line detection module is used for detecting based on the vehicle cabin model and the image data to obtain the sight line staring condition of the user in the vehicle; wherein the gaze fixation condition comprises whether a gaze location is detected, and a duration of each of the gaze locations upon detection of the gaze location;

an instruction execution module for determining that the user instruction needs to be executed in response to the gaze location including detecting the gaze location, and determining a first executed object of the user instruction based on a duration of each of the gaze locations, and executing the user instruction for the first executed object.

13. A human-computer interaction system, comprising a microphone, a camera and a vehicle, wherein the microphone and the camera are respectively coupled with the vehicle, the microphone is used for collecting voice data, the camera is used for shooting image data, and the vehicle is used for executing the human-computer interaction method according to any one of claims 1 to 11.

14. A computer readable storage medium, characterized in that program instructions executable by a processor for implementing the human-machine interaction method of any one of claims 1 to 11 are stored.