CN112053689A

CN112053689A - Method and system for operating equipment based on eyeball and voice instruction and server

Info

Publication number: CN112053689A
Application number: CN202010953494.0A
Authority: CN
Inventors: 黄石磊; 刘轶; 程刚
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-08

Abstract

The invention discloses a method, a system and a server for operating equipment based on eyeballs and voice instructions. The system comprises a mobile terminal, a server and an AR/VR device; the mobile terminal is used for sending the collected user voice, the user position and the user identification UID to the server; the server is used for identifying and analyzing the voice of the user to acquire a wake-up word and an instruction, and if the wake-up word is consistent with the wake-up word set by the UID, the server determines nearby operable equipment according to the position of the user; the AR/VR device is used for displaying the operable equipment and carrying out eyeball tracking so as to detect the point of regard of the user; the server is also used for determining the target equipment according to the point of regard of the user, and then sending instructions to the target equipment. The invention operates a plurality of devices through voice and eyeball instructions, the awakening words are irrelevant to the devices, the problems of interference in voice awakening and difficulty in memorizing the awakening words in a plurality of device scenes can be solved, and the method is suitable for VR/AR scenes.

Description

Method and system for operating equipment based on eyeball and voice instruction and server

Technical Field

The invention relates to the technical field of voice recognition and voice control, in particular to a method, a system and a server for operating equipment based on eyeballs and voice instructions.

Background

Speech Recognition (Speech Recognition) technology, also known as Automatic Speech Recognition (ASR), aims at converting Content (Content) in Speech into computer-readable input, such as keystrokes, binary codes or character sequences (text), and operates accordingly.

Voice wakeup is technically called keyword spotting (KWS for short), and one definition: speaker specific segments are detected in real time in a continuous speech stream. It should be noted here that the "real-time" of detection is a key point, and the purpose of voice wake-up is to activate the device from a sleep state to an operation state, so that after the wake-up word is spoken, it can be detected immediately, and the user experience will be better. The effect of voice awakening is evaluated, and the current indexes have four aspects, namely awakening rate, false awakening, response time and power consumption level.

Voice wake-up can be viewed as a specific application scenario for speech recognition, and in general, voice wake-up can be viewed as a speech recognition task for a specific word (for a specific system or device) (other words are ignored whether or not they are recognized); while typical speech recognition implies a recognition task of multiple words, such as voice command control, which may contain tens to hundreds of words, continuous speech recognition (LVCSR) may even contain hundreds of thousands of words.

The mainstream technology of speech recognition is based on Hidden Markov Models (HMMs), and a continuously distributed HMM Model called CDHMM is commonly used. In speech recognition tasks, an Acoustic Model (acoustics Model) and a Language Model (Language Model) are generally required. The acoustic model is one of the most important parts in the speech recognition system, and the mainstream system mostly adopts the HMM model for modeling. Language models can be divided into statistical language models and neural network language models that are now commonly used. Current speech recognition gradually moves to the framework of WFST (weighted finite-state transducer) + deep neural networks. The HMM model is easily expressed in the form of WFST.

The voice control device generally gives an instruction through voice, and the system judges the content of the instruction through voice recognition so as to perform corresponding action. At least speech recognition techniques are required here, and in some cases also speech awakenings. Voice wakeup is also a technical application of generalized speech recognition.

Voice command operation of a single device has been a well established solution, but simultaneous voice manipulation, in particular voice wake-up, of multiple devices is challenging. One difficulty with voice wake-up is how to wake-up in the presence of multiple devices at the same time.

Since each device has a specific wake-up word, different devices need to be activated by different wake-up words, but if there are too many devices, the memory of the wake-up words is a problem; and for some scenes, a plurality of devices of the same type (for example, a multi-person ward of a hospital, each patient has the same medical device) are provided, so that it is troublesome to set the wakeup word, because generally, the devices of the same type have the same wakeup word, and the devices interfere with each other, so that the problem of simultaneous wakeup is caused.

In VR (Virtual Reality)/AR (Augmented Reality) scenarios, voice wake-up and voice operation are a very effective means of interaction and operation if the user is in an immersive operational experience, typically without both hands free.

Currently, eye tracking technology is becoming an "important component" of VR/AR technology. Eyeball tracking is a sensor technology that enables a device to measure eyeball position and eyeball motion to determine where a person is attentive, what the person is interested in, and some biometric features. The eye tracking technology may be implemented using infrared devices and/or image capture devices such as cameras.

Disclosure of Invention

The invention aims to provide a method, a system and a server for operating equipment based on eyeballs and voice instructions, which are used for operating a plurality of pieces of equipment through voice and eyeball instructions and solve the problems of interference in voice awakening and difficulty in memorizing awakening words in a plurality of equipment scenes. The invention is suitable for VR/AR scenes, has the function of simultaneously waking up and operating a plurality of entity devices or virtual devices, and can operate a plurality of devices without hands.

In order to achieve the purpose, the invention adopts the following technical scheme.

In a first aspect, a system for operating a device based on eyeball and voice commands is provided, the system being used for operating a plurality of devices, the system comprising: the system comprises a mobile terminal, a server and an AR/VR device;

the mobile terminal is used for collecting user voice, determining the position of the user, and sending the collected user voice, the position of the user and the user identifier UID to the server;

the server is used for receiving the user voice, the user position and the user identification UID sent by the mobile terminal, identifying the user voice, analyzing the identification result, acquiring a wake-up word and an instruction, determining operable equipment near the user according to the user position if the wake-up word is detected to be consistent with the wake-up word set by the UID, and sending display content containing information of the operable equipment to the AR/VR device;

the AR/VR device is used for displaying the operable equipment in a multi-equipment mode, tracking eyeballs of the user and sending the detected information of the point of regard of the user to the server;

the server is further used for determining a target device which the user wants to operate according to the display content of the AR/VR device and the user's gaze point information, and then sending the instruction to the target device to instruct the target device to perform corresponding operation in response to the user voice.

Wherein the AR/VR device is an AR device or a VR device; the multi-device display refers to displaying a virtual scene of a plurality of operable devices according to the device positions on the display module.

In a possible implementation manner, the AR/VR apparatus is further configured to collect a user perspective and send the collected user perspective to the server; the server is further used for finding out devices (including devices visible to the user and devices possibly invisible and blocked by other objects) within the range of the user visual angle as operable devices according to the user visual angle when the operable devices located near the user are determined according to the user position.

In a possible implementation manner, the server is further configured to perform voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the server performs an operation of analyzing and responding the recognition result, otherwise, the server does not analyze the recognition result and does not perform a subsequent operation.

In a second aspect, a method for operating a device based on eyeball and voice instructions is provided, the method is used for operating a plurality of devices and comprises the following steps: the mobile terminal collects user voice, determines the user position and sends the collected user voice, the user position and the user identification UID to the server; the server receives the user voice, the user position and the user identification UID sent by the mobile terminal and identifies the user voice; the server analyzes the recognition result to obtain awakening words and instructions; if the server detects that the awakening words are consistent with the awakening words set by the UID, determining operable equipment near the user according to the position of the user; the server sends display content containing operable equipment information to the AR/VR device for multi-equipment display; the AR/VR device performs eyeball tracking on the user to detect the point of regard of the user and sends the point of regard information of the user to the server; the server determines target equipment which the user wants to operate according to the display content of the AR/VR device and the gaze point information of the user; and the server responds to the voice of the user and sends the instruction to the target equipment to instruct the target equipment to execute corresponding operation.

In one possible implementation, the method further includes: the AR/VR device collects the user visual angle and sends the collected user visual angle to the server; when determining operable equipment near the user according to the position of the user, the server finds out equipment (including equipment visible to the user and equipment possibly invisible and shielded by other objects) within the range of the user visual angle as the operable equipment by combining the user visual angle.

In one possible implementation, the method further includes: and the server carries out voiceprint recognition on the user voice, and executes the operation of analyzing and responding the recognition result when the voiceprint of the user voice belongs to the UID, otherwise, does not analyze the recognition result and does not carry out subsequent operation.

In a third aspect, a server is provided, including: the receiving module is used for receiving user voice, user position and User Identification (UID) sent by the mobile terminal; the voice processing module is used for identifying the voice of the user, analyzing the identification result, acquiring a wake-up word and a command, and detecting whether the wake-up word is consistent with the wake-up word set by the UID; the position selection module is used for determining operable equipment near the user according to the position of the user if the awakening word is consistent with the awakening word set by the UID; the sending module is used for sending the display content containing the operable equipment information to the AR/VR device for multi-equipment display; the receiving module is further used for receiving the user's gaze point information returned by the AR/VR device; the position selection module is further used for determining target equipment which the user wants to operate according to the display content of the AR/VR device and the user's gaze point information; the sending module is further configured to send the instruction to the target device in response to a user voice to instruct the target device to perform a corresponding operation.

In a possible implementation manner, the receiving module is further configured to receive user perspective information sent by the AR/VR apparatus; the position selection module is further used for finding out the equipment which is near the user and within the user visual angle range as the operable equipment by combining the user visual angle when the operable equipment near the user is determined according to the user position.

In a possible implementation manner, the voice processing module is further configured to perform voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the voiceprint of the user voice is analyzed and a response operation is performed on the recognition result, otherwise, the recognition result is not analyzed and no subsequent operation is performed.

According to the technical scheme, the embodiment of the invention has the following advantages:

1. the wake itself is not dependent on the device and the same device may have multiple wake words because, depending on the user-defined wake word, the wake word is tied to the user and not to the device.

2. Each authorized user can operate a plurality of devices without defining a wake-up word for each device, and the user is prevented from memorizing the wake-up words of the devices because the wake-up word of each user is fixed and is single under general conditions.

3. Awakening is independent of equipment distance, a voice acquisition device is installed on equipment to be controlled during awakening generally, and in the scheme, a mobile terminal carried by a user is used for voice acquisition. Optionally, a voice feedback device may be further installed in the device to be voice-controlled. In this scheme, can set up the pronunciation collection microphone for very little with the user distance (for example adopt a wearable equipment), can avoid when equipment department installation pronunciation collection equipment, the collection point is far away with user (speaker) general distance, probably shelters from moreover, and the collection effect is not good to and the near sound of still existing distance is big, the problem that the distance just can not be gathered far away.

4. Meanwhile, each user carries a respective mobile terminal to serve as voice acquisition equipment, and the potential benefit is that mutual interference is small, even if two users are in the same room (a certain distance is assumed), if the respective awakening words are spoken at the same time, the sound of the user is large for the equipment of the user, and the sound of an interferent (another user) is small. Further, the voice of the user A is collected by the collecting device of the user B when the two people are close, and the voiceprint recognition processing can be carried out, so that the system cannot be triggered by mistake (even if the awakening words set by the two users are the same, the system cannot be triggered by mistake).

5. The user watches the displayed equipment through the AR/VR device, and can determine the target equipment which the user wants to operate by carrying out eyeball tracking on the user without needing to specify the equipment manually or by voice or other modes, so that the method has the advantages of high speed, convenience in use and the like.

6. The method is suitable for VR/AR scenes, has the functions of simultaneously waking up and manipulating a plurality of entity devices or virtual devices, and can be used for operating a plurality of devices without hands.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a block diagram of a system for operating a device based on eye and voice commands according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for operating a device based on eye and voice commands in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a method for operating a device based on eye and voice commands in accordance with an embodiment of the present invention;

FIG. 4 is a schematic illustration of a multi-device display in an application scenario in accordance with an embodiment of the present invention;

fig. 5 is a block diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," and the like in the description and in the claims, and in the above-described drawings, are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The following are detailed descriptions of the respective embodiments.

In one embodiment of the present invention, a system for operating a device based on eye and voice commands is provided for operating a plurality of devices, such as a plurality of medical devices used in a hospital setting. FIG. 1 is a schematic diagram of the system; as shown in fig. 2, is a schematic diagram of the system.

As shown in fig. 1 and 2, the system may include a mobile terminal 10 and a server 20 and an AR/VR device 30 for performing voice control on a plurality of devices to be operated (or called executing devices).

Wherein, the mobile terminal 10 may have a voice collecting device 11, a positioning device 12 and a communication device 13. The voice collecting device 11 may be, for example, a neck-clip type or head-wearing or neck-hanging type microphone (or may have an earphone function at the same time), and is a device worn on the user, and is characterized by keeping a distance from the mouth/head of the user, preferably adopting near-field sound collection, and realizing a certain ability to suppress distant noise. The positioning device 12 may be a high precision indoor positioning device. The communication device 13 is used for communicating with the server 20, and various wireless communication modules such as a wifi module and/or a 4G module and/or a 5G module can be used.

For example, AR glasses and/or VR glasses may be used as the AR/VR device 30. The AR/VR device 30 may include a display module 31 and an eye tracking module 32, the display module 31 may be configured to perform a multi-device display, and the eye tracking module 32 may be configured to track the user's eyes to detect the user's gaze point.

The server 20 may include a processor 21, a memory (not shown), and a first communication module 22 and a second communication module 23. The memory stores one or more programs, and the processor 21 may execute the one or more programs to operate a plurality of functional modules (or referred to as program modules).

The first communication module 22 is configured to communicate with the mobile terminal 10, the second communication module 23 is configured to communicate with a device to be operated (an execution device), and the first and second communication modules may adopt various wireless communication modules such as a wifi module and/or a 4G module and/or a 5G module.

The plurality of functional modules may include a speech processing module for processing speech data, the speech processing module may further include a plurality of sub-modules: the device comprises a voice awakening module, a voice recognition module and a result analysis module. The voice awakening module is mainly used for activating the system through a voice command and activating other processing modules at the rear end; the voice recognition module is mainly used for carrying out voice-character conversion on voice instructions and other user voice information and transmitting a basic voice recognition result to the result analysis module for voice command analysis; these three submodules are interrelated and act in conjunction.

The plurality of functional modules may further include a device selection module for screening devices available to a user to determine devices requiring voice control, and the module may be further divided into two sub-modules: the device selects one module and the device selects two modules.

The execution device is used for carrying out relevant actions on the instructions sent by the server, and further carrying out feedback.

In one embodiment of the invention:

the mobile terminal 10 is configured to collect user voice through the voice collecting device 11, determine a user position through the positioning device 12, and send the collected user voice, the user position, a user identifier UID and other data to the server 20 through the communication device 13;

the server 20 is configured to receive data sent by the mobile terminal 10, recognize a user voice, analyze a recognition result, obtain a wake-up word and a command, determine an operable device located near the user according to a user position if it is detected that the wake-up word is consistent with a wake-up word set by the UID, and send display content including information of the operable device to the AR/VR device 30;

the AR/VR device 30 is used for displaying the operable equipment through the display module in a multi-equipment mode for a user wearing the AR/VR device to watch; and, performing eye tracking on the user through an eye tracking module, and sending the detected information of the point of regard of the user to the server 20;

the server 20 is further configured to determine a target device that the user wants to operate according to the display content of the AR/VR device 30 and the user's gaze point information, and then, in response to the user's voice, send the instruction to the target device to instruct the target device to perform a corresponding operation.

Further, in some embodiments, the AR/VR device 30 is further configured to collect a user perspective, and send the collected user perspective to the server 20; the server 20 is further configured to find a device located near the user and within the user view range as an operable device, in combination with the user view, when determining an operable device located near the user according to the user position.

Further, in some embodiments, the server 20 is further configured to perform voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the recognition result is analyzed and a response operation is performed; and when the voiceprint of the user voice does not belong to the UID, the recognition result is not analyzed, and subsequent operation is not executed.

Further, in some embodiments, in the server:

a device selection module operating in the processor 21 for determining devices located near the user based on the user location; and a device selection module, configured to find a device within a user view range from devices located near the user as an operable device according to the user view, send display content including information of the operable device to the AR/VR apparatus 30 for multi-device display, and determine, in combination with the gaze point information of the user returned by the AR/VR apparatus 30, a target device that the user wants to operate.

The voice processing module (i.e. the voice wake-up module, the voice recognition module and the result analysis module, which act in conjunction) running in the processor 21 is configured to: identifying the user voice, including voiceprint identification, and judging whether the voiceprint of the user voice belongs to the UID; when the voiceprint of the user voice belongs to the UID, analyzing the recognition result of the user voice to obtain a wake-up word and an instruction, and if the wake-up word is detected to be consistent with the wake-up word set by the UID, activating the system; and sending the instruction to the target device in response to the user voice.

Referring to fig. 3, an embodiment of the present invention further provides a method for operating a device based on eyeball and voice command, which is used to operate a plurality of devices. The method is implemented by a system as described above and may comprise the steps of:

31. the mobile terminal collects user voice, determines the user position, and sends the collected user voice, the user position, the User Identification (UID) and other data to the server.

32. And the server receives the data sent by the mobile terminal and identifies the voice of the user.

33. And the server analyzes the identification result to obtain the awakening words and the instructions.

34. And if the server detects that the awakening word is consistent with the awakening word set by the UID, activating the system and determining operable equipment nearby the user according to the position of the user.

35. The server transmits display content containing operable device information to the AR/VR device for multi-device display.

36. The AR/VR device performs eye tracking on the user to detect the user's point of regard and sends the user's point of regard information to the server. Wherein, fixation means the action of aiming the fovea of the eye to the target stimulus, and is one of three basic types of human eye movement, namely fixation, eye jump and following movement.

37. The server determines a target device which the user wants to operate according to the display content of the AR/VR device and the gazing point information of the user.

38. And the server responds to the voice of the user and sends the instruction to the target equipment to instruct the target equipment to execute corresponding operation.

Optionally, step 33 further includes: and (3) performing voiceprint recognition on the user voice, entering the step (33) when the voiceprint of the user voice belongs to the UID, and executing the operation of analyzing and responding the recognition result, otherwise, not analyzing the recognition result and not performing subsequent operation.

Further, the method further comprises: after the system is activated, the AR/VR device collects the user visual angle and sends the collected user visual angle to the server; then, in step 34, when determining the operable devices located near the user according to the user position, the server may find out the devices located near the user and within the user's view angle range as the operable devices, in combination with the user's view angle.

The system and method for operating a device based on eyeball and voice command according to the embodiment of the present invention are briefly described above with reference to fig. 1 to 3.

In the following, a detailed description of the implementation procedure of the present invention is provided using a medical scenario in conjunction with the working principle shown in fig. 2, and specifically includes the following steps.

S1, information acquisition.

The mobile terminal collects the voice of the user through a voice collecting device such as a recording device and collects the position for use through a positioning device. And encoding the User voice, the User position and the User Identification (UID) and sending the encoded User voice, the User position and the UID to the server. Where user location information and VR/AR are relevant, as this involves the visual system displaying, the content displayed being relevant to the location in the VR/AR interaction.

Optionally, VAD (Voice activity detection) may be performed at the mobile terminal, or VAD may be performed at the server.

Optionally, the mobile terminal may further collect information such as a user viewing angle and a fixation point of an eyeball through the AR/VR device.

And S2, selecting a module by the equipment for position screening.

Location screening, primarily for user-activatable devices, is for example within a virtual hospital ward, where devices near the user (doctor) are likely to be voice-activated and therefore selected.

Here, location information of the devices is required, which is typically a fixed location (unless some devices are mobile), and may be pre-stored in the server. Here, position information of the user (doctor) is also required. The device selection module screens out devices near the user, for example, within a certain range, according to the user position and a certain rule, and a plurality of devices constitute a device selection list 1.

As shown in fig. 4, if the joining user (doctor) is near the 19-bed, all the devices available in the 19-bed can be screened out, and only the 19-bed devices can be operated by voice. Of course, the distance between the user (doctor) and the device can be directly calculated, and all devices within the preset distance range can be screened out.

And S3, selecting two modules by the equipment for position screening.

The module mainly selects an output result of a module according to equipment, combines a user visual angle to determine that the equipment which is operable by a user and is in the user visual angle range is displayed on the display module in a multi-equipment mode so that the user can see the equipment, and then combines the eyeball gaze point information to determine the equipment which the user really wants to operate, namely the target equipment, from the displayed multiple equipment.

And S3.1, according to the screening structure output by the equipment selection first module and the current user visual angle, the equipment selection second module determines operable equipment which is positioned near the user and in the user visual angle range. A plurality of operable device users perform a multi-device display on the display module.

S3.2, the user sees a plurality of operable devices on the display module by utilizing the VR/AR device, the operable devices form a device selection list 2, and the user looks at the devices needing to be operated.

And S3.3, the VR/AR device tracks the eyeballs of the user, and specific certain equipment, namely the target equipment, is determined according to the eyeballs, so that the result analysis module can be further instructed to analyze the voice recognition result.

And S4, voice awakening.

The voice wake-up module and the voice recognition module may be interrelated with an eye tracking module of the VR/AR device and act in conjunction with the results parser. This step generally includes the following specific procedures.

S4.1 active voice detection (VAD), detecting if there is speech, this step can also be implemented in the terminal.

S4.2, voice awakening, if the fact that the voice contains awakening words (namely keywords) is detected, the activation state of the system is changed.

S4.2.1 first, the result parsing module needs to know the UID because each user uses its own wake word, and since the mobile terminal can be bound to the user, the result parsing module actually knows the UID of the known user when processing the audio.

S4.2.2 if the system is in an inactive state and detects that the user has spoken a wake up word for the user, the system enters an active state and the system will then respond to the user's instructions.

S4.2.3 if the system is inactive and no wake words are detected, the system will discard the speech recognition results.

S4.2.4 if the system is active and the system is not active for more than a certain activation time, the system will respond to the system input.

S4.2.5 the system is in an active state and the display module displays an operable device that can be operated by a user.

S4.2.6 when the system is activated, the user's eye is tracking on a device and enters a designated device state, and the command and information of the response are transmitted to the designated target device.

S4.3 optionally, the server may add a voiceprint recognition module.

S4.3.1 when the system detects the awakening word, it performs speaker voice print confirmation for the user voice corresponding to the awakening word, here, if the speaker is confirmed, the system enters the activated state; if the speaker voiceprint is not confirmed, the speaker is still in the inactive state.

S4.3.2 the system is active, detects that the user is always inputting speech information, and marks whether the speech is that of a known user using voiceprint recognition techniques, and if so, it is interpreted, and if not, it is not interpreted.

And S5, voice recognition.

S5.1 a speech recognition module, or called speech recognition decoder (decoder), for converting speech into text information, where the text information includes command information, some commands may also have parameters, information input contents, etc., and may include a wakeup word (in the case of a combination of wakeup and recognition).

S5.2 the speech recognition decoder involves an Acoustic Model (AM), a Language Model (LM), a pronunciation Dictionary (Dictionary), where existing techniques in speech recognition, such as modulo techniques, can be used.

s6. analyzes the result.

The basic result of the speech recognition needs to be analyzed, that is, corresponding action is performed according to the result output by the recognizer.

Here, the instruction information of the user may start with an activation word, for example: "Wake up word + instruction".

For example, "Xiaorui, increase flow" (for 19-bed infusion machine)

The wake-up word "small rayls-small rayls" is associated with the user and not the device. In this way, each user may use a different (respective) wake-up word for the same device.

The device does not require special wake words and the user (doctor) almost peacefully directs his assistant to operate the device with exactly the same password.

And S7, displaying by multiple devices.

In step S6, after the user enters the activated state, the display module may display a virtual scene of the operable device, as shown in fig. 4, which may be overlaid with the original display content (if in the VR mode, the original display content is virtual content; and if in the AR mode, the real scene object is displayed). It is possible that some devices are occluded by other objects, but in this view the devices need to be explicitly shown, for example, as a user dashed block diagram. The operable device shown is generally plural in number.

S8, eyeball tracking

After the multiple devices are displayed in S7, it can be detected by eye tracking which device the user gazes at, and then which device the user gazes at, the gazed target device is in the active state (note that this is the device active state, and the meaning of the user activated voice command state is different).

S9, equipment action and feedback

S9.1 the server sends an instruction to the selected target device.

S9.2 the target device carries out relevant actions.

S9.3, optional, and the target equipment makes relevant feedback when needed; optionally, the display module of the VR/AR device may also have relevant feedback, as well as user-side voice feedback.

In the following, the implementation flow of the embodiment of the present invention is illustrated with reference to a medical scenario example.

In one implementation example, the doctor wears an intelligent mobile terminal, namely a special AR doctor terminal. It includes an AR glasses for displaying augmented reality content and related medical information, a voice acquisition device, an indoor positioning device, and a communication device. The terminal can be provided with a program for hospital installation for encrypted communication, is connected to a special network of the hospital through a 4G wireless communication system, the server is located in a machine room of the hospital, and each medical device is also connected to the special network of the hospital through the encrypted network.

The voice collecting device may include: the collar-clip type microphone is used for collecting the voice of a doctor; and the earphone is used for enabling the doctor to hear the feedback sound of the mobile terminal.

The mobile terminal is also provided with an indoor positioning device, so that the position of a doctor can be obtained in real time, and the doctor can be accurate to a room (a ward) and a specific position in the room (for example, the doctor can be positioned near a plurality of sickbeds). If there is an error in positioning, the doctor can also actively set the position where the doctor is located, for example, by reading the NFC tag/barcode on the patient bed through a smart phone, and determine that the doctor is operating 19 the relevant equipment of the patient.

s1. information collection

The doctor (UID 001) can talk to the patient directly or via other doctors and nurses. For example:

19-bed transfusion machine, Xiaorui, Accelerator "

19 bed sickbed 'Xiaorui, sickbed elevation'

19 bed body temperature collecting device 'Xiaorui, collect body temperature'

19 bed sickbed 'Xiaolan, Hospital bed is raised'

17 bed sickbed 'Xiaorui, sickbed elevation'

'Li doctor, please see the medication … … of this patient'

Suppose another doctor (UID D002) in the vicinity wants to manipulate 19 bed temperature acquisitions, say "small rayls, acquire body temperature", which is input into the system by the ID001 device acquisition. Note that this doctor's own defined wake-up word is also "small rayls" as D001.

The user voice (assuming 16kHz, 16bit PCM encoding) is collected by the voice collecting means, and the user voice and the user position and the user UID are transmitted to the server. Here, the UID of the user (doctor) is D001.

Assuming that Voice Activity Detection (VAD) is not performed at the mobile terminal, all voices collected by the mobile terminal are sent to the background server, recording is performed in the whole process at the server, and VAD is performed at the server.

s2. transmitting data

And s2.1, transmitting the user voice data and the user position information to a server of a hospital intranet data center through a 4G mobile communication network of the intelligent mobile terminal and an external communication Gateway (Gateway) of the hospital.

The information required by the display of the s2.2 AR/VR device is also transmitted to the intelligent terminal worn by the doctor through the server.

s2.3 the server receives the data, which is typically real-time streaming data, while it stores it in real-time, assuming that the packet is 200 ms.

s3. device selection

s3.1 the device selects a module to determine a device list that can be operated by the user according to the location information collected by the user side, for example: 1-10 parts of equipment;

s3.2 selecting a module to output according to the equipment: devices 1-10, and user perspective direction, determining that the devices within the user perspective range are: equipment 1-equipment 4; the display module displays: device 1 to device 4.

s3.3 when the user sees multiple devices, the user determines which device to look at all among the devices 1 to 4.

For example, a device 1 is selected, the device 1 being for example an infusion machine.

The device 2, the device 2 being for example a body temperature acquisition apparatus.

Device 3, device 3 being for example a hospital bed.

Device 4, device 4 being for example a ventilator.

It is assumed that the device 3 is finally selected for gaze.

s3.3 instructs the result analysis module to analyze the speech recognition result and obtain an instruction, for example, the device 3 is a hospital bed, the possible operation instructions and operation information of the device 3 are limited, and the hospital bed only has three instructions of raising, lowering and locking.

s4. Voice Wake-Up

The voice awakening module and the voice recognition module are mutually associated with the eyeball tracking module and jointly act with the result analysis module. This step generally includes the following specific procedures

s4.1 here, using the prior art, as a voice wake-up recognizer based on Hidden Markov Model (HMM), the activation word of the user (UID0001) is "small rayleigh", and the system comprises a model of the wake-up word and a background model; by detecting and comparing the probabilities of the two models in real time and setting a threshold, when the probability of the awakening word model in the voice is higher than the probability of the background model by the threshold, the awakening word is considered to be detected.

s4.2 active voice detection (VAD), if voice is detected, continuing the next work; no action is continued if no valid speech is detected.

s4.3, voice awakening, if the voice is detected to contain awakening words (keywords), the activation state of the system is changed; ready to be converted from the inactive state to the active state.

s4.3.1 the result parser module first needs to know the user UID because each user uses its own wake-up word, and since the terminal device can be bound to the user, the parser actually knows the UID of the known user when processing the audio,

s4.3.2 if the system is in the inactive state, and the user is detected to speak the user's wake-up word, the system is in the active state, and then the system will respond to the user's instruction.

s4.3.3 if the system is inactive and no wake-up word is detected, the system discards the recognition result.

s4.3.4 if the system is active and the system is still active for not more than a certain activation time, the system will respond to the input from the system.

The s4.3.5 system is in an activated state and the display module displays devices that can be operated by a user. Four devices are available to operate, and the AR glasses are marked with a dashed box. The device location is framed with a dashed box and brief device information, such as device name and device status, is displayed alongside.

When the s4.3.6 system is in an activated state, the eyeball of a user tracks and focuses on a certain device, the state of the specified device is entered, and the instruction and information of the response are transmitted to the relevant device.

For example:

for example, when the current user says "small rayls", the system enters a ready-to-activate state, and the voiceprint detection confirms that the user UID is D0001, and confirms that the system enters an active state;

at the moment, the system firstly obtains a screening result of a module selected by the device; the information of the screened operable devices is transmitted to a display module, and the display module obtains a possible device list 2 according to the visual angle.

For example, when the current user says "Xiaolan", the system will not enter the active state; here "small blue and small blue" is another user wake-up word, the current user will not activate,

the current user says "li doctor please see the medication … … for this patient" and the current user does not activate.

s4.4 optionally, a voiceprint recognition module may be added.

The s4.4.1 system detects the awakening word, then voice print confirmation of the speaker is carried out aiming at the voice corresponding to the awakening word, and here, if the speaker is confirmed, the system enters an activation state; if the speaker voiceprint is not confirmed, the speaker is still in the inactive state.

The s4.4.2 system is in an activated state, detects that the user inputs voice information all the time, and marks whether the voice is the voice of the known user by using a voiceprint recognition technology, and if so, the voice is analyzed, and if not, the voice is not analyzed.

For example, another doctor (UID D002) in the vicinity may want to manipulate 19 bed temperature acquisitions to say "small rayleigh" because the user is not speaking and is being acquired by the device with user UID D001 and input into the system. Note that the activation word defined by the doctor himself is set to "small rayls" as the UID is D001, and the system is not activated when the UID of the doctor is D001.

s5. Speech recognition

The voice recognition and voice awakening can be combined into one, and the combined action is realized through the voice recognition module and the result analysis module. This step generally includes the following specific procedures.

s5.1 speech recognition module, i.e. recognition decoder (decoder), for converting speech into text information, where the text information includes instruction information, some instructions may also carry parameters, information input content, etc.

In this example, a weighted finite-state-automaton (WFST) based decoder may be used, and feature extraction may be performed using a Deep Neural Network (DNN).

s5.2 recognition decoder relates to acoustic model AM (acoustic model), language model LM (language model), pronunciation Dictionary, and here, the modeling technique in speech recognition can be used.

In this example, the speech recognized by the recognizer is, for example:

"small end, accelerate", the recognizer will recognize: "small end, fast".

"little good, the sick bed is raised", the recognizer will discern: "Xiaorui, the bed is raised".

"little rui, gather body temperature", the recognizer will discern: "Xiaorui, collect body temperature".

The "li-doctor, please see the patient's medication … …" in the current system, this is not recognized because the aforementioned voice wake-up does not put the system into an active state.

Another doctor (UID is D002) in the vicinity wants to manipulate 19 bed temperatures to collect the spoken "small rayls, body temperature", because the user does not speak, and is collected and input into the system by the UID001 device. In current systems this is not recognized, since the aforementioned voice wake-up does not put the system into an active state.

s6. result of analysis

s6.1 here, the voice command message of the user starts with a wake-up word: "Wake-up word + instruction"; after the resolution is complete, the "device + instruction".

For example, the user says "small end and large flow rate", and the user looks at the infusion apparatus (apparatus 1) and analyzes as "19-bed infusion apparatus and large flow rate".

The device does not require special activation words, the user looks at the 19-bed infusion machine during voice commands and then speaks 'increase flow', the doctor almost always agrees with the password for directing his assistant to operate the device, and the hands do not stop working.

Another example is:

"Xiaorui, raise", the user selects the 19-bed, and the resolution is "19-bed, raise".

The user selects a 19-bed-questions temperature acquisition device and analyzes the device into a 19-bed temperature acquisition device for acquisition.

"small rayl, high", the user tries to select 17 bed beds, but the device list 2 does not give 17 bed beds, so the system cannot resolve.

s6.2 if the system is active and the system is still active for not more than a certain activation time (e.g. 5 seconds), then the system will respond to the input.

For example: after saying "small rayls and small rayls", the user pauses for 4 seconds and then watches 19 beds to say "body temperature acquisition": the system (module S6) enters the active state when it receives the "small rayls", and will remain in the active state. However, if the pause is 6 seconds, the system will return to the inactive state, at which time the system display will stop displaying the candidate devices 1-4, and the system will ignore this input, say "acquire body temperature".

And s6.3, the system is in an activated state, and after the analysis result is completed, the response instruction and information are transmitted to the relevant equipment.

s7. issue commands

And the server determines whether to send the instruction to the specified target equipment or not according to the analysis result of the user instruction and the determined target equipment.

If no instruction needs to be sent, error information can be directly fed back to the user.

If the instruction needs to be sent, the specified instruction needs to be sent to the specified device.

For example:

"Xiaorui, accelerate" +19 beds of infusion machines, then 19 beds of infusion machines feed back voice "19 beds of infusion machines accelerate to xxx ml per minute", and flash;

when the patient is small in small Rui and lifted +19 beds, the 19 bed disease equipment feeds back voice, namely 19 beds are lifted, and flickers;

"Xiaorui, collect body temperature" +19 bed body temperature collector, then 19 bed body temperature collection equipment feeds back voice "19 bed body temperature collection goes on. . . Body temperature 36.8 degrees, normal body temperature ", and flash;

the system ignores the instruction and has no feedback;

"small rui, raise" +17 beds sick bed: the intelligent terminal (the earphone worn) directly feeds back voice 'no instruction received' to the user and flickers.

Referring to fig. 5, an embodiment of the present invention further provides a server, including:

a receiving module 51, configured to receive a user voice, a user location, and a user identifier UID sent by a mobile terminal;

the voice processing module 52 is configured to recognize a user voice, analyze a recognition result, acquire a wakeup word and a command, and detect whether the wakeup word is consistent with the wakeup word set by the UID;

a location selection module 53, configured to determine, according to a user location, an operable device located near the user if the wakeup word is consistent with the wakeup word set by the UID;

a sending module 54, configured to send display content including information of an operable device to the AR/VR apparatus for multi-device display;

the receiving module 51 is further configured to receive the gazing point information of the user returned by the AR/VR device;

the position selection module 53 is further configured to determine a target device that a user wants to operate according to the display content of the AR/VR apparatus and the gaze point information of the user;

the sending module 54 is further configured to send the instruction to the target device in response to the user voice to instruct the target device to perform a corresponding operation.

In some embodiments, the receiving module 51 is further configured to receive user perspective information returned by the AR/VR device; the position selecting module 53 is further configured to find, when determining an operable device located near the user according to the user position, a device located near the user and within the user view range as an operable device in combination with the user view angle.

In some embodiments, the voice processing module 52 is further configured to perform voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the voiceprint is analyzed and a response operation is performed on the recognition result, otherwise, the recognition result is not analyzed and no subsequent operation is performed.

To sum up, the embodiment of the present invention discloses a method, a system and a server for operating a device based on eyeball and voice instruction, and the above technical solutions show that the embodiment of the present invention has the following advantages:

5. The user watches the displayed equipment through the AR/VR device, and can determine the target equipment which the user wants to operate by carrying out eyeball tracking on the user without manually or by voice or in other modes, so that the method has the advantages of high speed, convenience in use and the like.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; those of ordinary skill in the art will understand that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A system for operating a device based on eye and voice commands for operating a plurality of devices, the system comprising: the system comprises a mobile terminal, a server and an AR/VR device;

2. The system of claim 1,

the AR/VR device is also used for collecting the user visual angle and sending the collected user visual angle to the server;

and the server is also used for finding out equipment which is positioned near the user and within the user visual angle range as the operable equipment by combining the user visual angle when the operable equipment positioned near the user is determined according to the position of the user.

3. The system of claim 2,

and the server is also used for carrying out voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the recognition result is analyzed and the response operation is carried out.

4. A method for operating a device based on eye and voice commands for operating a plurality of devices, the method comprising:

the mobile terminal collects user voice, determines the user position and sends the collected user voice, the user position and the user identification UID to the server;

the server receives the user voice, the user position and the user identification UID sent by the mobile terminal and identifies the user voice;

the server analyzes the recognition result to obtain awakening words and instructions;

if the server detects that the awakening words are consistent with the awakening words set by the UID, determining operable equipment near the user according to the position of the user;

the server sends display content containing operable equipment information to the AR/VR device for multi-equipment display;

the AR/VR device performs eyeball tracking on the user to detect the point of regard of the user and sends the point of regard information of the user to the server;

the server determines target equipment which the user wants to operate according to the display content of the AR/VR device and the gaze point information of the user;

and the server responds to the voice of the user and sends the instruction to the target equipment to instruct the target equipment to execute corresponding operation.

5. The method of claim 4, further comprising:

the AR/VR device collects the user visual angle and sends the collected user visual angle to the server;

when determining the operable equipment near the user according to the position of the user, the server finds out the equipment near the user and within the user visual angle range as the operable equipment by combining the user visual angle.

6. The method of claim 4, further comprising:

and the server carries out voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the recognition result is analyzed and response operation is carried out.

7. A server, comprising:

the receiving module is used for receiving user voice, user position and User Identification (UID) sent by the mobile terminal;

the voice processing module is used for identifying the voice of the user, analyzing the identification result, acquiring a wake-up word and a command, and detecting whether the wake-up word is consistent with the wake-up word set by the UID;

the position selection module is used for determining operable equipment near the user according to the position of the user if the awakening word is consistent with the awakening word set by the UID;

the sending module is used for sending the display content containing the operable equipment information to the AR/VR device for multi-equipment display;

the receiving module is further used for receiving the user's gaze point information returned by the AR/VR device;

the position selection module is further used for determining target equipment which the user wants to operate according to the display content of the AR/VR device and the user's gaze point information;

the sending module is further configured to send the instruction to the target device in response to a user voice to instruct the target device to perform a corresponding operation.

8. The system of claim 7,

the receiving module is further configured to receive user perspective information sent by the AR/VR device;

the position selection module is further used for finding out the equipment which is near the user and within the user visual angle range as the operable equipment by combining the user visual angle when the operable equipment near the user is determined according to the user position.

9. The system of claim 7,

and the voice processing module is also used for carrying out voiceprint recognition on the user voice, and when the voiceprint of the user voice belongs to the UID, the recognition result is analyzed and the response operation is carried out.