CN112114672A

CN112114672A - Eye movement and voice combined auxiliary interaction device and method

Info

Publication number: CN112114672A
Application number: CN202011008098.7A
Authority: CN
Inventors: 胡飞; 李伟哲
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-22

Abstract

The application discloses eye movement and voice combined auxiliary interaction device and method, comprising the following steps: the device comprises a voice unit, a positioning unit and a main control unit; the voice unit is used for acquiring a voice signal of a user and transmitting the voice signal to the main control unit; the positioning unit is used for acquiring the fixation point signal, identifying the position information corresponding to the fixation point signal and transmitting the fixation point signal containing the position information to the main control unit; the main control unit is used for acquiring a fixation point signal and converting the fixation point signal containing the position information into a screen coordinate; and the system is also used for acquiring the sound signals, identifying the control instructions in the sound signals and implementing the control instructions on corresponding screen coordinates. The application solves the technical problems that the existing eye movement interaction device is single in operation mode and difficult to operate in a complex way.

Description

Eye movement and voice combined auxiliary interaction device and method

Technical Field

The application relates to the technical field of human-computer interaction, in particular to an eye movement and voice combined auxiliary interaction device and method.

Background

The eye movement-based human-computer interaction technology is an eyeball tracking technology, and is characterized in that an electric signal containing eye information is collected through external equipment (such as an optical camera and an infrared transmitting and receiving device), and the electric signal is processed by an algorithm to extract characteristic signals of eyes, such as fixation, eye jump, blink and the like. The eye movement human-computer interaction is realized by converting a characteristic signal of an eyeball tracking technology into a moving command of a cursor on a screen or a selection command (such as clicking, long pressing and the like) of a control so as to realize the function of human-computer interaction.

The related art provides a lot of eyeball tracking, however, the following problems exist: in the case of using only eye movement as an interaction channel, when blinking is used as a trigger signal, the closing of both eyes is not synchronous, but in tandem, the phase difference is within a few milliseconds, and the blink recognition algorithm is based on pupil detection, that is, when no pupil is detected, a blink is recognized, which results in an actual recognition result of a blinking process as follows: single-eye closure → both-eyes closure → single-eye opening → both-eyes opening, which causes misrecognition even with normal blinking; when people's eyes are closed, the pupil can disappear gradually, the change of pupil size can be identified in the computer detection algorithm, and in the screen fixation point positioning algorithm, the fixation point position can be determined according to the pupil shape of different positions of the eye fixation screen, so that the closed detection algorithm is easy to be confused with the fixation point positioning algorithm, and the result of the cause is that the actual recognition result in the blinking process is: during the process of gaze point change → pupil disappearance (blinking) → pupil appearance → gaze point change, namely, eye closure, the gaze position appearing on the screen changes, and the user's desired click position deviates from the actual click position.

The second disadvantage is that eye movement is a single control channel, and the realization is a simple clicking function, and in an actual scene, besides clicking, common interaction modes include long-time pressing, dragging, rolling and the like, so that the user operation is single, and inconvenience is caused by the need of additionally designing an interaction scene or repeatedly switching controls. Solving this problem by multi-modal human-machine interaction based on eye movements is an exploitable direction, but how to do this in particular, there is no detailed and definite solution.

Disclosure of Invention

The application provides an eye movement and voice combined auxiliary interaction device and method, and solves the technical problems that an existing eye movement interaction device is single in operation mode and is difficult to operate in a complex mode.

In view of the above, a first aspect of the present application provides an eye-movement-combined voice-assisted interaction apparatus, comprising:

the method comprises the following steps: the device comprises a voice unit, a positioning unit and a main control unit;

the voice unit is used for acquiring a voice signal of a user and transmitting the voice signal to the main control unit;

the positioning unit is used for acquiring a fixation point signal, identifying position information corresponding to the fixation point signal and transmitting the fixation point signal containing the position information to the main control unit;

the main control unit is used for acquiring the fixation point signal and converting the fixation point signal containing the position information into a screen coordinate; and the control module is also used for acquiring the sound signal, identifying a control instruction in the sound signal and implementing the control instruction on the corresponding screen coordinate.

Optionally, the positioning unit is further configured to filter the gaze point signal containing the position information.

Optionally, the main control unit further includes a cache unit;

the buffer unit is used for buffering the fixation point signal in a preset time period.

Optionally, the display unit is further configured to display an identifier on the screen coordinate position corresponding to the gaze point signal.

Optionally, the interactive device further comprises a power supply unit for providing stable power supply for the interactive device.

A second aspect of the present application provides an eye movement combined voice-assisted interaction method, including:

acquiring the number of pupils of a user and a fixation point signal;

identifying position information corresponding to the gazing point signal;

identifying that the gaze point signal corresponds to screen coordinates on a screen;

and acquiring a sound signal of a user, identifying a control instruction in the sound signal, and implementing the control instruction on the corresponding screen coordinate.

Optionally, after acquiring the number of pupils of the user and the gaze point signal, the method further includes:

if the pupil number and the fixation point signal are not acquired, judging that the current state of the user is eye closing;

and recording the number of the pupils and the duration of the fixation point signal, and stopping the device when the duration is greater than the preset closing duration.

Optionally, after the identifying the position information corresponding to the gaze point signal, the method further includes:

filtering the gaze point signal containing the position information.

Optionally, the filtering the gaze point signal including the position information specifically includes:

in the formula (I), the compound is shown in the specification,

the position of the fixation point after the filtering processing; n is a sliding window; p₀A first gaze location; e is the influence coefficient of the fixation point; i is the ith gaze point.

Optionally, the control instruction includes a left key, a right key, a double click, and a release.

According to the technical scheme, the method has the following advantages:

in this application, an eye movement combines supplementary interactive installation of pronunciation includes: the device comprises a voice unit, a positioning unit and a main control unit; the voice unit is used for acquiring a voice signal of a user and transmitting the voice signal to the main control unit; the positioning unit is used for acquiring the fixation point signal, identifying the position information corresponding to the fixation point signal and transmitting the fixation point signal containing the position information to the main control unit; the main control unit is used for acquiring a fixation point signal and converting the fixation point signal containing the position information into a screen coordinate; and the system is also used for acquiring the sound signals, identifying the control instructions in the sound signals and implementing the control instructions on corresponding screen coordinates.

The method comprises the steps that through the combination of a positioning unit and a voice unit, the positioning unit determines position information on a screen, and then the voice unit is combined to execute corresponding operation on the position, so that the execution of an operation instruction is completed; and the device can complete continuous operation by continuously identifying the position information and the operation instruction in the voice signal, so that the device can complete multiple and complicated operations without depending on a mouse and a keyboard.

Drawings

FIG. 1 is a block diagram of an eye movement integrated voice assisted interaction device according to an embodiment of the present application;

FIG. 2 is a block diagram of an apparatus for eye movement coupled with another embodiment of a voice-assisted interaction apparatus according to the present application;

FIG. 3 is a flow chart of a method in an embodiment of an eye movement integrated voice assisted interaction method of the present application;

fig. 4 is a schematic diagram of a positioning process of the positioning unit according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The first embodiment,

Fig. 1 is a block diagram of an apparatus of an embodiment of an eye movement combined voice-assisted interaction apparatus according to the present application, where fig. 1 includes:

the positioning unit 101 is configured to acquire a gaze point signal, identify position information of the gaze point signal, and transmit a signal including the position information to the main control unit 103.

It should be noted that the positioning unit 101 may be a conventional commercial eye tracker, and the principle thereof is to collect the electrical signals of the eye information through an optical camera, an infrared emitting device and a receiving device, so as to collect the gaze point signals of the eyes of the user. After the collected gazing point signal, the positioning unit 101 may send the gazing point signal to the main control unit 103 for analysis.

The voice unit 102 is configured to acquire a voice signal of a user and transmit the voice signal to the main control unit.

It should be noted that the voice unit 102 may employ a microphone and a speaker to collect a sound signal, and transmit the collected sound signal to the main control unit 103.

The main control unit 103 is configured to acquire a gaze point signal, identify position information corresponding to the gaze point signal, and convert the gaze point signal into a screen coordinate; and the system is also used for acquiring the sound signals, identifying the control instructions in the sound signals and implementing the control instructions on corresponding screen coordinates.

It should be noted that, in the present application, the gazing point signal may be analyzed and identified by the main control unit, so as to determine a position of the gazing point signal corresponding to a pixel coordinate on the screen, after the pixel position is obtained, if the voice unit 102 identifies the voice signal, the main control unit 103 identifies the control instruction corresponding to the voice signal, so as to complete an operation corresponding to the control instruction at the pixel position, that is, execute an operation instruction. When the user needs to complete more operations, the steps can be repeated to execute multiple operation instructions, so that multiple and complex operations can be completed.

Example II,

The present application further provides another embodiment of an eye movement combined voice-assisted interaction apparatus, as shown in fig. 2, fig. 2 includes a positioning unit 201, a voice unit 202, and a main control unit 203.

The positioning unit may refer to a schematic flow chart shown in fig. 4; in addition, the positioning unit 202 is further configured to filter the gaze point signal containing the position information.

It should be noted that, considering the real-time performance of eye movement data collection and random noise (error caused by blinking), the actually collected data may have obvious and frequent offset, and a filtering algorithm may be designed to perform real-time processing on the collected gaze point signal, specifically:

the first fixation point is named as P₀Assigning the initial influence coefficient E of the point of regard to 1, and aiming at each newly collected point of regard P_iDefining a cluster center P_clusterThe average calculation is performed through a sliding window with the length of N, and the formula is as follows:

in the formula (I), the compound is shown in the specification,

the position of the fixation point after the filtering processing; e is the influence coefficient of the fixation point; i is the ith gaze point. The influence coefficient E is actually an empirical 0-1 weight, and passes through the newly acquired point of regard P_iAnd P_clusterThe Euclidean distance between the two elements is determined, and the Euclidean distance d satisfies the following relation:

in the formula (I), the compound is shown in the specification,

for newly acquired point of regard P_iIs the cluster center P_clusterPosition of

d is the distance between two points measured in pixels. When d is greater than a certain distance threshold T (which can be considered an empirical value), the coefficient E will be influenced_iSetting 0, namely refusing to adopt the current fixation point; when d is smaller than the distance threshold, setting E to 1, specifically defined as follows:

calculating Euclidean distance between the clustering center and newly acquired data, and comparing distance threshold to obtain influence coefficient E_iAnd then the new collection point of regard P_iCalculating a gaze location after filtering

The calculation formula is as follows:

in this embodiment, the main control unit 203 further includes a buffer unit, configured to buffer the gaze point signal within a preset time period.

It should be noted that the buffer unit may be configured to buffer a position sequence (in a format of (x, y), corresponding to two-dimensional coordinates on the screen) of the gaze point information from the positioning unit with a certain length. When the first point of regard P is collected₀Simultaneously assigning P to the cache location in the cache unit_TFor each filtered gaze point

Given a plateau threshold S, if

The gaze point of the current position is set

Assign to cache location P_TI.e. by

Otherwise, the buffer position is set as the original value, i.e. P_T＝P_TThis allows slight coordinate changes due to nystagmus to be filtered out, and long-distance eye jumps (blinks) to be still recognized, resulting in an improved accuracy of selection.

This embodiment still includes: and the display unit 204 is configured to display the identifier at the screen coordinate position corresponding to the gazing point signal.

It should be noted that, in the present application, the display module may be a display screen, for example, a display screen with a mouse arrow, that is, the gaze point of the user may be used as the mouse arrow, the operation of the mouse arrow is controlled by the voice signal of the user, and the control instruction in the voice signal may include a left key, a right key, a double click and a release.

And the power supply unit 205 is used for providing stable power supply for the interaction device.

The method and the device avoid position identification deviation caused by the change of the screen fixation point when eyes are closed by introducing a filtering algorithm; through the combination of the positioning unit and the voice unit, the positioning unit determines the position information on the screen, and then the voice unit is combined to execute corresponding operation on the position, thereby completing the execution of one operation instruction; and the device can complete continuous operation by continuously identifying the position information and the operation instruction in the voice signal, so that the device can complete multiple and complicated operations without depending on a mouse and a keyboard.

The application also includes an embodiment of practical application, specifically, the voice unit may be composed of a microphone and a speaker, the microphone may be used to record a voice signal sent by a user in real time and transmit the voice signal back to the main control unit; the speaker may be used as an acoustic feedback output of the interaction state. The main control unit mainly detects the characteristic voice of the microphone signal of the voice unit as follows: "left key", "right key", "double click" and "release".

When the Left key is recognized, the main control unit outputs a Left key pressing command, and meanwhile, a cache unit in the main control unit records that LD is 1, namely, the Left key is pressed (Left Down);

when the 'Right key' is recognized, the main control unit outputs a 'Right key press' command, and meanwhile, a buffer unit in the main control unit records that RD is 1, namely, a Right key press (Right Down);

when the double-click is recognized, the main control unit continuously outputs a left key pressing command to a left key lifting command for 2 times at intervals of 0.2s, and simultaneously records that LD is 0 to realize the double-click of the left key;

when the release is identified, if the LD is equal to 1, the main control unit outputs a left key lifting command and simultaneously records the LD is equal to 0; if RD equals 1, the main control unit outputs a 'right key lifting' command, and simultaneously, RD equals 0 is recorded; through the above commands, the functions of clicking, long-time pressing, dragging and scrolling can be realized. The specific implementation logic is as follows:

clicking: the user watches a certain area, speaks a 'left/right key' and 'releases';

long press: the user gazes at a certain area, and speaks a 'left/right key';

dragging: the user watches a certain area to say a 'left key', and says 'release' after the user shifts through the positioning module;

and (3) rolling: a user watches a certain area and speaks a 'right key', at the moment, when the position of an x axis is recorded by the interaction device, the interaction device is uniformly multiplied by 0, namely, only the movement action on a y axis (vertical axis) is recorded, the main control unit outputs a rolling command, and the length is the distance difference corresponding to the front watching point and the rear watching point on the vertical axis. After the scrolling is finished, the 'release' is spoken, and the scrolling is finished.

The foregoing is an embodiment of the apparatus of the present application, which further includes an embodiment of an eye movement combined voice assisted interaction method, as shown in fig. 3, where fig. 3 includes:

301. acquiring the number of pupils of a user and a fixation point signal;

302. identifying position information corresponding to the fixation point signal;

303. identifying that the gaze point signal corresponds to screen coordinates on a screen;

304. acquiring a voice signal of a user, identifying a control instruction in the voice signal, and implementing the control instruction on a corresponding screen coordinate.

In a specific embodiment, after step 301, the method further includes:

and recording the number of the pupils and the duration of the gazing point signal, and stopping the device when the duration is longer than the preset closing duration.

After step 302, further comprising:

the gaze point signal containing the position information is filtered.

The specific steps of filtering the gaze point signal containing the position information are as follows:

in the formula (I), the compound is shown in the specification,

The control instruction in this embodiment may include a left key, a right key, a double click, and a release.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "comprises," "comprising," and "having," and any variations thereof, in this application are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An eye-movement-combined voice-assisted interaction device, comprising: the device comprises a voice unit, a positioning unit and a main control unit;

2. The eye movement and voice-assisted interaction device according to claim 1, wherein the positioning unit is further configured to filter the gaze point signal containing the position information.

3. The eye movement and voice-assisted interaction device according to claim 1, wherein the main control unit further comprises a buffer unit;

4. The eye movement and voice-assisted interaction device according to claim 1, further comprising a display unit for displaying an identifier on the screen coordinate position corresponding to the gaze point signal.

5. The eye movement and voice-assisted interaction device according to claim 1, further comprising a power supply unit for supplying stable power to the interaction device.

6. An eye movement combined voice assisted interaction method is characterized by comprising the following steps:

acquiring the number of pupils of a user and a fixation point signal;

identifying position information corresponding to the gazing point signal;

7. The eye movement and voice-assisted interaction method according to claim 6, further comprising, after the acquiring the number of pupils and the gaze point signal of the user:

8. The eye movement and voice-assisted interaction method according to claim 6, further comprising, after the identifying the position information corresponding to the gaze point signal:

filtering the gaze point signal containing the position information.

9. The eye movement and voice-assisted interaction method according to claim 8, wherein the filtering the gaze point signal containing the position information includes:

in the formula (I), the compound is shown in the specification,

10. The eye movement combined voice-assisted interaction method according to claim 6, wherein the control instruction comprises a left key, a right key, a double click and a release.