CN112612358A

CN112612358A - Human and large screen multi-mode natural interaction method based on visual recognition and voice recognition

Info

Publication number: CN112612358A
Application number: CN201910946153.8A
Authority: CN
Inventors: 丁建华
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-10-03
Filing date: 2019-10-03
Publication date: 2021-04-06

Abstract

The invention provides a human and large screen multi-mode natural interaction method based on visual recognition and voice recognition. The working principle of the method comprises the following steps: the system comprises an image sensing module, an image recognition module and an image analysis module; the voice recognition module is used for recognizing voice; the system comprises an interactive window module, an interactive position module and an interactive command module; the system comprises a video signal source, a video interaction matrix and a large display screen. The system obtains interaction position and interaction instruction information required when a person interacts with display contents on a large screen by sensing, identifying and analyzing the action and voice information of fingers (or arms) of a user and combining interaction initial position, action interaction instruction and voice interaction instruction information preset by the system, generates an interaction instruction, drives video signal source equipment and a video interaction matrix to output video signals of corresponding display contents, and displays the video signals on the large screen, so that multi-mode natural interaction of complex display contents of 2-dimensional or 3-dimensional space of the person and the large screen is realized.

Description

Human and large screen multi-mode natural interaction method based on visual recognition and voice recognition

Technical Field

The invention relates to the technical field of natural interaction between a person and a large screen in human-computer interaction, and provides a person and large screen multi-mode natural interaction method based on visual recognition and voice recognition.

Background

At present, several methods for directly interacting display contents on a large screen exist in the market, such as pen touch, pointer touch, remote control interaction and the like facing the large screen, most of the methods need to be operated by a special interaction tool in hands, such as an interaction pen, an interaction pointer, an interaction remote controller and the like, and complete natural interaction between a person and the large screen cannot be achieved. If the operator needs to stand in a place close to a large screen for operation, when the screen is large, the user is very inconvenient to see the screen, and a plurality of interaction blind areas which cannot be reached exist; if a remote control interaction mode is adopted, although interaction can be carried out at a distance from the screen, if the number of function keys on the remote controller is less, a plurality of complex interaction functions cannot be realized; if a plurality of function keys are arranged on the remote controller, the problem of inconvenient operation can occur due to small space on the remote controller, thereby influencing the experience effect and the interaction speed; the method adopts a somatosensory interaction mode, and because the current somatosensory technology is not perfect enough, only a few simple gesture action interactions can be realized, and the interactive content on a large screen can not be positioned, the complex interaction requirements of a plurality of content formats of a large-screen multi-signal source can not be met. And the existing interaction method can only realize the interaction of the plane display content on a large screen, and can not realize the three-dimensional interaction of the three-dimensional display content.

Disclosure of Invention

In order to solve the problems, the invention provides a human-large screen multi-mode natural interaction method based on visual recognition and voice recognition, and a user can naturally interact with complex display contents on a large screen in a 2-dimensional or three-dimensional space only by means of natural language for human communication such as finger, arm action, voice and the like without any interaction tool.

The system work flow of the method is shown in fig. 1, and mainly comprises the following steps: the optical perception module acquires 2D or 3D visual information generated by a scene through a real-time optical image; an image recognition module for extracting 2D or 3D position and motion information of the fingers and arms of the user from the obtained visual information; the image analysis module extracts interaction position information and action interaction instruction information generated when a person interacts with display contents on a large screen from 2D or 3D position and motion information of fingers and arms of the user according to a signal source cursor initial position information base and an action interaction instruction base model preset by the system; the sound perception module acquires sound information of a scene through the microphone array; the voice recognition module extracts the voice character information of the user through the voice information; the voice analysis module extracts voice interaction instruction information of a user from the voice character information according to a voice interaction instruction model library preset by the system; the window interaction module is used for acquiring interaction window information (which refers to a specific large-screen signal display window), interaction window instruction information (which refers to the amplification, the reduction, the roaming, the switching, the opening or the closing and the like of the large-screen signal display window) and initial position information of a cursor from the action interaction instruction or the voice interaction instruction; the interactive position module is used for determining the interactive position information of the cursor according to the movement of the finger or arm of the user and the initial position of the cursor on the large screen; an interactive command module: generating interactive command information according to the interactive position information, the action interactive instruction or the voice interactive instruction of the cursor, wherein the interactive command information is used for driving the video signal source equipment to change display contents or driving the video interactive matrix to change the output mode of the video signal; a video signal source: used for outputting the video signal to be displayed; the video interaction matrix sends the video signal of the video signal source to the large display screen according to the interaction command; and displaying the large screen and displaying the multi-channel video signals transmitted by the video interaction matrix.

The invention adopts a mode of combining visual recognition and voice recognition, designs a plurality of customized schemes according to the characteristics of the two interactive modes, can effectively solve the defects of incapability of positioning interactive positions, simple functions and incapability of realizing the interaction of complex contents in the existing simple somatosensory interactive mode, and can realize the natural interactive function of displaying the contents of a large-screen multi-signal source in a bare-handed way without carrying any interactive tool by a user while keeping away from a screen.

Particularly, the invention can overcome the defects of the existing non-natural interaction mode, and can realize the multi-mode natural interaction of 2-dimensional or three-dimensional space with the complex display content on the display large screen only by means of human natural communication languages such as fingers, arm actions, voice and the like.

The optical image sensing module and the image recognition module are arranged in front of a user, so that the optical image sensing of the human body action of the user is realized, and the position and action information of fingers and arms of the user is recognized and extracted through the image recognition function built in the optical image sensing and recognition module; the voice information of the user can be acquired through the voice sensing module and the voice recognition module which are installed near the user, and the voice interaction instruction information in the voice information can be recognized. The information is sent to a natural interaction server in a wired or wireless mode, interaction software in the natural interaction server analyzes and processes the information to generate interaction command information, and the interaction command information is sent to a video interaction matrix or corresponding video signal source equipment, so that the display content of the video signal on a large screen is changed, and corresponding reactions are made to the interaction actions of the user.

The large display screen can be a display screen with any size and any shape, such as an LCD screen, an LED screen, a DLP screen, a projection screen and the like, and can also be a spliced screen; each display screen is provided with a video signal input port, and the video signal input port can be in any system, including but not limited to HDMI, DVI and VGA port forms.

The video signal source device refers to a terminal device capable of generating video signals, such as a computer, a video camera, a PAD, a mobile phone, and the like, which are connected to the video interaction matrix in a wired or wireless manner.

Preferably, the optical perception and recognition module employs Azure Kinect developed by microsoft corporation;

preferably, the voice perception module and the voice recognition module adopt voice perception and recognition products developed by the science and technology communication airlines, or voice perception and recognition products developed by the Baidu corporation;

preferably, the large display screen is a spliced large screen;

preferably, the video signal source device is a computer;

preferably, the natural interaction server is a computer;

preferably, the image analysis module, the voice analysis module, the window interaction module, the interaction position module and the interaction command module are realized by different software function modules in a natural interaction server;

preferably, the video interaction matrix is a splicing controller;

preferably, the video signal source device (computer) is connected to the splice controller through an HDMI cable;

preferably, the video output signal output by the splicing controller is connected to a corresponding video input port of the spliced large screen.

The optical perception and recognition module and the voice perception and recognition module can be respectively connected with the natural interaction server in a wired or wireless mode to carry out data communication; preferably, the optical sensing and recognition module and the voice sensing and recognition module are respectively connected with the natural interaction server in a wired mode to perform data communication.

The user station is positioned in an area which can be effectively sensed by the optical sensing module and the sound sensing module, and the interaction is carried out on the display content on the large screen; the specific interaction action of the user can comprise the movement action of making fingers or gestures in the air or speaking preset interaction instruction information.

The optical perception module acquires 2D or 3D visual information generated by a scene through a real-time image.

The image recognition module extracts 2D or 3D position and motion information of fingers and arms of a user from the visual information through a software algorithm; the optical perception and recognition module adopted by the invention adopts a mode of combining the depth sensor and the high-definition camera, and can obtain the three-dimensional position coordinates and the three-dimensional motion track coordinates of the human skeleton including fingers and arms according to the skeleton and joint structure model of the human body, so that the interaction can be carried out on the three-dimensional display images, such as the interaction on three-dimensional holographic projection and the like, besides the content of two-dimensional plane display images.

The image analysis module compares the 2D or 3D position and motion information of the fingers and arms of the user with data information in a preset window database 1, a cursor initial position database and an action interaction instruction database according to a software algorithm, extracts interaction position information and action interaction instruction information generated when a person interacts with display contents on a large screen from the 2D or 3D position and motion information of the fingers and arms of the user, and transmits the interaction position information and the action interaction instruction information to window interaction module software; the window database 1 is a database of corresponding relationship between a preset signal source window and finger or arm actions; the cursor initial position database refers to the geometric position of a preset initial cursor position in a display window of a signal source on a large screen; the action interaction instruction database is a preset database of the corresponding relation between the finger and arm actions and the action interaction instructions.

The sound perception module acquires sound information of a scene through a microphone array arranged around a user.

The voice recognition module extracts the voice information of the user from the acquired scene sound information through a software algorithm and converts the voice information into voice character information.

The voice analysis module compares the voice character information with information in a preset window database 2, a cursor initial position database and a voice interaction instruction database through a software algorithm to extract voice interaction instruction information of a user; the window database 2 is a database of corresponding relations between preset different signal source windows and voice character information, and the cursor initial position database refers to a geometric position of a preset initial cursor position in a display window of a signal source on a large screen; the voice interactive instruction database is a preset voice character information and interactive instruction corresponding relation database.

The computer software corresponding to the interactive window module determines interactive window information (indicating a specific large-screen signal display window), interactive window instruction information (indicating the enlargement, reduction, roaming, switching, opening or closing of the large-screen signal display window and the like), and initial position information of a cursor from the interactive position information and the action interactive instruction information output by the image analysis module or the voice interactive instruction information output by the voice analysis module.

Computer software corresponding to the interactive position module determines interactive position information of the cursor according to the movement information of the finger or the arm of the user and the initial position of the cursor on the large screen; the method for determining the interaction position of the user on the large screen is determined by the preset voice or action interaction instruction and the action of fingers or arms; specifically, a preset voice or motion interaction instruction is used to determine a signal source window at each interaction, for example, speaking: the natural interaction server lights the display window 2 on the large screen through an output instruction and moves the cursor to the geometric center position of the window 2 when the operation window 2 is operated; the voice input may also implement a variety of cursor movement functions, such as the user speaking: "cursor moves to the upper left corner and then to the left … …"; or the following steps: "cursor moves to a certain display content"; or the following steps: "open a certain presentation document", "search a certain knowledge point", and so on; the above functions can also be performed by preset actions, such as waving the arm two times to the right, which indicates "operation window 2"; and setting an interactive instruction corresponding to the gesture or the finger action. The actions of the left and right keys, the setting of interaction positions, the zooming and the displacement of the window size; playback of documents, etc.; or the action of combining gestures and voice is adopted, if the palm of the left hand is lifted, the mode of voice instruction recognition is started, and then the voice 'operation window 2' is spoken out, so that the functions are realized; then the cursor moves spatially according to the displacement of the finger movement.

And the computer software corresponding to the interactive command module generates interactive command information according to the interactive position information, the action interactive command or the voice interactive command of the cursor, and is used for driving the video signal source equipment to change the display content or driving the video interactive matrix to change the output mode of the video signal.

And the video signal source drives the video signal source to output a corresponding video signal to the video interaction matrix according to the interaction command.

And the video interaction matrix drives the video interaction matrix to output a corresponding video signal to the large screen according to the interaction command.

The interactive display module displays corresponding video signal content on a large screen.

The system can also enter a speech text input mode through a speech perception and recognition module, such as a user speaking: when the system enters a voice text input mode, the system enters the voice text input mode, a text box pops up at the position of a cursor on a large screen, the user speaks to appear in the text box in a character form, and after the user finishes speaking, the user can speak: and sending the text to a certain member through a wired or wireless network according to an address book preset by the system, or first establishing a communication group, and sending the text to each member in the communication group like a WeChat group.

Because the optical perception and recognition module adopted by the scheme adopts a mode of combining the depth sensor and the high-definition camera, the three-dimensional position coordinates and the three-dimensional motion track coordinates of the human skeleton including fingers and arms can be obtained according to the skeleton and joint structure model of the human body, so that the interaction can be carried out on the three-dimensional stereo display image, such as the interaction on three-dimensional holographic projection, besides the content of the two-dimensional plane display image.

Through the mode, people based on visual recognition and voice recognition can naturally interact with the large-screen multimode.

The invention has the advantages that the invention provides the human-large screen multi-mode natural interaction method based on visual recognition and voice recognition, which can overcome the defects and shortcomings of the existing non-natural interaction mode, and can realize the multi-mode natural interaction in 2-dimensional or three-dimensional space with the complex display content displayed on the large screen only by means of human natural communication languages such as fingers, arm actions, voice and the like.

The above embodiments are all preferred embodiments of the present invention, and therefore do not limit the scope of the present invention. Any equivalent structural and equivalent procedural changes made to the present disclosure without departing from the spirit and scope of the present disclosure are within the scope of the present disclosure as claimed.

Drawings

FIG. 1 is a flow chart of a human-large screen multi-modal natural interaction method based on visual recognition and speech recognition.

FIG. 2 is a diagram of a preferred system architecture for a human-large screen multimodal natural interaction method based on visual recognition and speech recognition.

Detailed Description

The following detailed description of the preferred embodiment of the present invention, taken in conjunction with fig. 2, will make the advantages and features of the present invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the present invention.

A preferable system of the method mainly comprises the following parts: the Azure Kinect product of Microsoft is used for acquiring optical image information of a user and identifying position information and motion information of a palm of the user; the voice perception and recognition product of the science and technology communication airline company or the Baidu company is used for acquiring the voice information of the operator and recognizing the voice information in the voice information; the natural interaction server receives the Azure Kinect product and interaction information sent by a voice perception and recognition product of a science and technology communication carrier or a Baidu company in a wired mode, analyzes and generates interaction position information and interaction instruction information, combines the interaction position information and the interaction instruction information to generate interaction command information, and sends the interaction command information to corresponding video signal source equipment or a video interaction matrix; video signal source equipment: used for outputting the video signal to be displayed; the video interaction matrix sends the video signal of the corresponding video signal source to the large display screen according to the interaction command; the large display screen displays the multi-channel video signals transmitted by the video interaction matrix; and the switch transmits the interactive instruction.

The specific implementation steps are as follows.

Step one, the devices are connected according to the mode of figure 2, and the devices are started to be in a normal working state. At the moment, a plurality of signal windows are displayed on the large screen, and the display content of each signal window is generated by video signal source equipment and is accessed to the large screen through a video interaction matrix.

Secondly, the user station can be perceived by voice of an Azure Kinect product and a science news airline company or a Baidu company before the large screen and can recognize the product in a perceivable area, and interaction is carried out on the large screen content; the interactive action can comprise the movement action of making a finger or a gesture in the air or speaking preset interactive instruction information; when the range of the user's action is large at the time of interaction, the perception range can be expanded by adding Azure Kinect products and voice perception and recognition products.

Step three, when a user interacts with the display content on the large screen by adopting finger or arm action, the Azure Kinect products deployed around the user have the functions of an optical perception module and an image recognition module; the product shoots three-dimensional optical image information of user actions through a built-in depth sensor and a high-definition camera, obtains three-dimensional position coordinates and three-dimensional motion track coordinates of a human body skeleton including fingers and arms of a user through a built-in software algorithm and a human body skeleton and joint structure model, can interact contents of two-dimensional plane display images, and can also interact three-dimensional stereo display images, such as three-dimensional holographic projection and the like.

Step four, image analysis module software in the natural interaction matrix compares the three-dimensional position coordinates and three-dimensional motion track coordinates of the fingers and the arms of the user with data information in a preset window database 1, a cursor initial position database and a motion interaction instruction database, extracts interaction position information and motion interaction instruction information generated when a person interacts with display contents on a large screen from the 2D or 3D position and motion information of the fingers and the arms of the user, and transmits the interaction position information and the motion interaction instruction information to window interaction module software, so that a large screen signal source window needing to be interacted is determined, the signal window is highlighted, and the cursor on the large screen is positioned at the geometric center of the signal source window to be interacted; the window database 1 is a database of corresponding relationship between a preset signal source window and finger or arm actions; the cursor initial position database refers to the geometric position of a preset initial cursor position in a display window of a signal source on a large screen; the action interaction instruction database is a preset database of the corresponding relation between the finger and arm actions and the action interaction instructions.

And step five, when the user adopts voice information to interact with the large-screen display content, a microphone array in a voice perception and recognition product of a science and technology communication carrier or a hundredth company, which is arranged around the user, firstly acquires the sound information of the scene, and a voice recognition module which is arranged in the voice perception and recognition product extracts the voice information of the user from the acquired scene sound information through a software algorithm and converts the voice information into voice character information.

Comparing the voice character information with information in a preset window database 2, a cursor initial position database and a voice interaction instruction database through software algorithm by voice analysis module software in the natural interaction matrix, and extracting voice interaction instruction information of a user; the window database 2 is a database of corresponding relations between preset different signal source windows and voice character information, and the cursor initial position database refers to a geometric position of a preset initial cursor position in a display window of a signal source on a large screen; the voice interactive instruction database is a preset voice character information and interactive instruction corresponding relation database.

Step seven, determining an interactive window: when a user interacts with large-screen display contents, the user usually needs to tell the interactive system about a large-screen signal display window needing interaction; the user can wave the gesture through the preset interactive gesture signal to make corresponding action, or speak out the corresponding interactive voice instruction according to the preset interactive voice instruction signal, so as to determine the large-screen signal display window needing to be interacted. The specific working mode is shown as step three to step six.

Step eight, determining interactive window information (which refers to a specific large-screen signal display window), interactive window instruction information (which refers to the amplification, reduction, roaming, switching, opening or closing of the large-screen signal display window and the like), and initial position information of a cursor by computer software corresponding to the interactive window module from the interactive position information and the action interactive instruction information output by the image analysis module or the voice interactive instruction information output by the voice analysis module; preferably, when the signal window is operated, the peripheral outline of the signal window is highlighted, and the initial position of the cursor is generally determined as the geometric center of the interactive window.

Step nine, determination of interaction positions: computer software corresponding to the interactive position module determines interactive position information of the cursor according to the movement information of the finger or the arm of the user and the initial position of the cursor on the large screen; the method for determining the interaction position of the user on the large screen is determined by a preset action interaction instruction (see an action interaction instruction model library) or a voice interaction instruction (see a voice interaction instruction model library) and the action of fingers or arms; specifically, a preset voice or motion interaction instruction is used to determine a signal source window at each interaction, for example, speaking: the natural interaction server lights the display window 2 on the large screen through an output instruction and moves the cursor to the geometric center position of the window 2 when the operation window 2 is operated; the voice input may also implement a variety of cursor movement functions, such as the user speaking: "cursor moves to the upper left corner and then to the left … …"; or the following steps: "cursor moves to a certain display content"; or the following steps: "open a certain presentation document", "search a certain knowledge point", and so on; the above functions can also be performed by preset actions, such as waving the arm two times to the right, which indicates "operation window 2"; and setting an interactive instruction corresponding to the gesture or the finger action. The actions of the left and right keys, the setting of interaction positions, the zooming and the displacement of the window size; playback of documents, etc.; or the action of combining gestures and voice is adopted, if the palm of the left hand is lifted, the mode of voice instruction recognition is started, and then the voice 'operation window 2' is spoken out, so that the functions are realized; then the cursor moves spatially according to the displacement of the finger movement.

Step ten, an interactive command forming mechanism: and computer software corresponding to an interactive command module in the natural interaction server generates interactive command information according to the interactive position information, the action interactive command or the voice interactive command of the cursor, and the interactive command information is used for driving the video signal source equipment to change the display content or driving the video interactive matrix to change the output mode of the video signal.

And step eleven, the video signal source drives the video signal source to output a corresponding video signal to the video interaction matrix according to the interaction command.

Step twelve, the video interaction matrix drives the video interaction matrix to output corresponding video signals to the large screen according to the interaction command.

And step thirteen, the interactive display module displays corresponding video signal content on the large screen.

Step fourteen, the system can also enter a speech text input mode through a speech perception and recognition module, for example, a user says: when the system enters a voice text input mode, the system enters the voice text input mode, a text box pops up at the position of a cursor on a large screen, the user speaks to appear in the text box in a character form, and after the user finishes speaking, the user can speak: and sending the text to a certain member through a wired or wireless network according to an address book preset by the system, or first establishing a communication group, and sending the text to each member in the communication group like a WeChat group.

And step fifteen, because the optical perception and recognition module adopted by the scheme adopts a mode of combining the depth sensor and the high-definition camera, the three-dimensional position coordinates and the three-dimensional motion track coordinates of the human skeleton including fingers and arms can be obtained according to the skeleton and joint structure model of the human body, so that the three-dimensional display images can be interacted besides the content of the two-dimensional plane display images, for example, the three-dimensional holographic projection is interacted.

The invention provides a human-large screen multi-mode natural interaction method based on visual recognition and voice recognition, which can overcome various defects and shortcomings of the existing non-natural interaction mode, and a user can realize multi-mode natural interaction in 2-dimensional or three-dimensional space with complex display contents displayed on a large screen only by means of human natural communication languages such as fingers, arm actions, voice and the like without any interaction tool.

Claims

1. A human and large screen multi-mode natural interaction method based on visual recognition and voice recognition is characterized in that the system work flow of the method mainly comprises the following steps: the optical perception module acquires 2D or 3D visual information generated by a scene through a real-time optical image; an image recognition module for extracting 2D or 3D position and motion information of the fingers and arms of the user from the obtained visual information; the image analysis module extracts interaction position information and action interaction instruction information generated when a person interacts with display contents on a large screen from 2D or 3D position and motion information of fingers and arms of the user according to a signal source cursor initial position information base and an action interaction instruction base model preset by the system; the sound perception module acquires sound information of a scene through the microphone array; the voice recognition module extracts the voice character information of the user through the voice information; the voice analysis module extracts voice interaction instruction information of a user from the voice character information according to a voice interaction instruction model library preset by the system; the window interaction module is used for acquiring interaction window information (which refers to a specific large-screen signal display window), interaction window instruction information (which refers to the amplification, the reduction, the roaming, the switching, the opening or the closing and the like of the large-screen signal display window) and initial position information of a cursor from the action interaction instruction or the voice interaction instruction; the interactive position module is used for determining the interactive position information of the cursor according to the movement of the finger or arm of the user and the initial position of the cursor on the large screen; an interactive command module: generating interactive command information according to the interactive position information, the action interactive instruction or the voice interactive instruction of the cursor, wherein the interactive command information is used for driving the video signal source equipment to change display contents or driving the video interactive matrix to change the output mode of the video signal; a video signal source: used for outputting the video signal to be displayed; the video interaction matrix sends the video signal of the video signal source to the large display screen according to the interaction command; and displaying the large screen and displaying the multi-channel video signals transmitted by the video interaction matrix.

2. The human-large screen multi-modal natural interaction method based on visual recognition and voice recognition is characterized in that preferably, the optical perception module and the image recognition module adopt Azure Kinect developed by Microsoft corporation, the product is an enterprise application scheme combining a depth sensor, a high-definition camera and a spatial microphone array, and the product tracks a moving human body in a 3D mode, so that a complete, clear and uniquely identified multi-skeletal body tracking capability is obtained, then the actions of fingers and arms can be accurately tracked, and the finger position information and the arm action information of a user are sent to a lower computer in a wired or wireless mode.

3. The method for the multi-modal natural interaction between the person and the large screen based on the visual recognition and the voice recognition is characterized in that preferably, the voice perception module and the voice recognition module adopt voice perception and recognition products developed by the science and technology communication airlines or hundredths companies, and the products acquire voice signals of users through microphone arrays, recognize and convert the voice signals into voice character information and send the voice character information to a lower computer in a wired or wireless mode.

4. The method for multi-modal natural interaction between people and a large screen based on visual recognition and voice recognition is characterized in that preferably, the image analysis module, the voice analysis module, the window interaction module, the interaction position module and the interaction command module are realized by different software function modules in a natural interaction server, and the natural interaction server is a computer.

5. The method for the multi-modal natural interaction between the person and the large screen based on the visual recognition and the voice recognition is characterized in that computer software corresponding to an image analysis module extracts interaction position information and action interaction instruction information generated when the person interacts with display contents on the large screen from 2D or 3D positions and motion information of fingers and arms of the user according to a signal source cursor initial position information base and an action interaction instruction base model preset by a system; computer software corresponding to the voice analysis module extracts voice interaction instruction information of the user from the voice character information according to a voice interaction instruction model library preset by the system; computer software corresponding to the window interaction module acquires interaction window information (which refers to a specific large-screen signal display window), interaction window instruction information (which refers to the amplification, reduction, roaming, switching, opening or closing and the like of the large-screen signal display window) and initial position information of a cursor from the action interaction instruction or the voice interaction instruction; computer software corresponding to the interactive position module determines interactive position information of the cursor according to the movement of the finger or arm of the user and the initial position of the cursor on the large screen; and the computer software corresponding to the interactive command module generates interactive command information according to the interactive position information, the action interactive command or the voice interactive command of the cursor, and is used for driving the video signal source equipment to change the display content or driving the video interactive matrix to change the output mode of the video signal.

6. The visual recognition and voice recognition based multi-modal natural interaction method of people and large screens is characterized in that the method for determining the interaction position of a user on the large screen is jointly determined by preset voice or action interaction instructions and finger or arm actions; specifically, a preset voice or motion interaction instruction is used to determine a signal source window at each interaction, for example, speaking: the natural interaction server lights the display window 2 on the large screen through an output instruction and moves the cursor to the geometric center position of the window 2 when the operation window 2 is operated; the voice input may also implement a variety of cursor movement functions, such as the user speaking: "cursor moves to the upper left corner and then to the left … …"; or the following steps: "cursor moves to a certain display content"; or the following steps: "open a certain presentation document", "search a certain knowledge point", and so on; the above functions can also be performed by preset actions, such as waving the arm two times to the right, which indicates "operation window 2"; setting an interactive instruction corresponding to the gesture or the finger action; the actions of the left and right keys, the setting of interaction positions, the zooming and the displacement of the window size; playback of documents, etc.; or the action of combining gestures and voice is adopted, if the palm of the left hand is lifted, the mode of voice instruction recognition is started, and then the voice 'operation window 2' is spoken out, so that the functions are realized; then the cursor moves spatially according to the displacement of the finger movement.

7. The human-large screen multi-modal natural interaction method based on visual recognition and voice recognition is characterized in that as the mode of combining the depth sensor and the high-definition camera is adopted in the optical perception and recognition module adopted in the scheme, the method can obtain the three-dimensional position coordinates and the three-dimensional motion track coordinates of the human skeleton including fingers and arms according to the skeleton and joint structure model of the human body, thereby not only interacting the contents of two-dimensional plane display images, but also interacting three-dimensional stereo display images, such as interacting three-dimensional holographic projection and the like.

8. The method for multi-modal natural interaction of people with large screen based on visual recognition and voice recognition according to claim 1, characterized in that a preferred system of the method is mainly composed of the following parts: the Azure Kinect product of Microsoft is used for acquiring optical image information of a user and identifying position information and motion information of a palm of the user; the scientific communication fly voice perception and recognition product or hundred-degree voice perception and recognition product is used for acquiring the voice information of an operator and recognizing the voice information; the natural interaction server receives interaction information sent by the Azure Kinect product, the scientific and general news flying voice perception and recognition product or the Baidu voice perception and recognition product in a wired or wireless mode, analyzes and generates interaction position information and interaction instruction information, combines the interaction position information and the interaction instruction information to generate interaction command information, and sends the interaction command information to corresponding video signal source equipment or a video interaction matrix; video signal source equipment: used for outputting the video signal to be displayed; the video interaction matrix sends the video signal of the corresponding video signal source to the large display screen according to the interaction command; the large display screen displays the multi-channel video signals transmitted by the video interaction matrix; and the switch transmits the interactive instruction.

9. The method for multi-modal natural interaction of people and large screens based on visual recognition and voice recognition is characterized in that the system can enter a voice text input mode through a voice perception and recognition module, such as a user speaking: when the system enters a voice text input mode, the system enters the voice text input mode, a text box pops up at the position of a cursor on a large screen, the user speaks to appear in the text box in a character form, and after the user finishes speaking, the user can speak: and sending the text to a certain member through a wired or wireless network according to an address book preset by the system, or first establishing a communication group, and sending the text to each member in the communication group like a WeChat group.