CN111103982A

CN111103982A - Data processing method, device and system based on somatosensory interaction

Info

Publication number: CN111103982A
Application number: CN201911386166.0A
Authority: CN
Inventors: 谈丹
Original assignee: Shanghai Paper Juechi Intelligent Technology Co Ltd
Current assignee: Shanghai Paper Juechi Intelligent Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-05

Abstract

The application discloses a data processing method, a device and a system based on somatosensory interaction. The method comprises the steps of receiving voice input in the environment, and identifying an interactive function selected to be executed by a user, wherein the interactive function is determined through voice recognition; receiving gesture input in the video, and determining an action intention based on the interactive function, wherein the action intention is determined through multi-modal action recognition; continuing to receive video input, and extracting effective information in the action area based on the action intention, wherein the effective information refers to user intention obtained through image retrieval and language and character recognition; and feeding back the interaction result matched according to the effective information to the user. The technical problem that the interaction effect of the desktop system is poor is solved. Through the touch type body sensing interaction method and device, a user can carry out touch type body sensing interaction in the real world, and the interaction mode is more efficient and natural.

Description

Data processing method, device and system based on somatosensory interaction

Technical Field

The application relates to the field of intelligent hardware, in particular to a data processing method, a device and a system based on somatosensory interaction.

Background

With the popularization of smart phones and tablet computers, the requirements of other sense organs of users cannot be met more and more only by moving the finger head, and even the functions of other sense organs are weakened, so that the perception and the operation capability are poorer and poorer.

The inventor finds that the existing desktop system cannot ensure that a user interacts with the real world and can feed back interaction experience in real time.

Aiming at the problem of poor interaction effect of a desktop system in the related art, an effective solution is not provided at present.

Disclosure of Invention

The application mainly aims to provide a data processing method, device and system based on somatosensory interaction so as to solve the problem that the interaction effect of a desktop system is poor.

In order to achieve the above object, according to an aspect of the present application, there is provided a data processing method based on somatosensory interaction.

The data processing method based on the somatosensory interaction comprises the following steps: receiving voice input in an environment, and identifying an interactive function selected and executed by a user, wherein the interactive function is determined by voice recognition; receiving gesture input in the video, and determining an action intention based on the interactive function, wherein the action intention is determined through multi-modal action recognition; continuing to receive video input, and extracting effective information in the action area based on the action intention, wherein the effective information refers to user intention obtained through image retrieval and language and character recognition; and feeding back the interaction result matched according to the effective information to the user.

Further, receiving speech input in the environment, and recognizing the interactive function selected to be performed by the user comprises:

obtaining a voice instruction through voice input in an environment;

and identifying the interactive function of learning, reading or game selected and executed by the user according to the voice instruction.

Further, receiving gesture input in the video, and determining the action intention based on the interaction function includes:

receiving gesture input in a video to obtain user intention, and determining recognized words, recognized sentences or solved problems based on a learning interaction function;

and/or receiving gesture input in the video to obtain user intention, and determining point reading of a drawing area or accompanying reading of teaching materials based on the reading interaction function;

and/or receiving gesture input in the video to obtain the user intention, and determining card identification, line identification or module identification based on the game interaction function.

Further, continuing to receive video input and extracting valid information in the action region based on the action intent comprises: continuing to receive the video input, and extracting a gesture motion track and a gesture position in the action area based on the action intention, wherein the gesture motion track comprises: at least clicking on the gesture trajectory, sweeping through the gesture trajectory, or sweeping through the circle trajectory.

Further, the step of feeding back the interaction result matched according to the effective information to the user comprises: and displaying the interaction result of learning, reading or game matched according to the effective information to the user.

In order to achieve the above object, according to another aspect of the present application, there is provided a data processing apparatus based on somatosensory interaction.

According to the application, the data processing device based on the somatosensory interaction comprises: the first interactive module is used for receiving voice input in the environment and identifying an interactive function selected and executed by a user, wherein the interactive function is determined by voice recognition; the second interaction module is used for receiving gesture input in the video and determining action intention based on the interaction function, wherein the action intention is determined through multi-modal action recognition; the third interaction module is used for continuously receiving video input and extracting effective information in the action area based on the action intention, wherein the effective information refers to the user intention obtained through image retrieval and language and character recognition; and the feedback module is used for feeding back the interaction result matched according to the effective information to the user.

Further, the first interaction module is used for

Obtaining a voice instruction through voice input in an environment;

Further, the second interaction module is used for

Further, the third interaction module is configured to

Continuing to receive the video input, and extracting a gesture motion track and a gesture position in the action area based on the action intention, wherein the gesture motion track comprises: at least clicking on the gesture trajectory, sweeping through the gesture trajectory, or sweeping through the circle trajectory.

In order to achieve the above object, according to another aspect of the present application, there is provided a desktop system based on somatosensory interaction, including: mutual equipment is felt with body to intelligent terminal and body to make the user in real scene, through body feels mutual equipment with intelligent terminal carries out body and feels alternately, body feels mutual equipment for provide and to carry out body and feel mutual one or more interactive stage property, intelligent terminal equipment includes: the device comprises an image acquisition device, a voice acquisition device, a display device and a voice broadcasting device, wherein the image acquisition device is used for monitoring image information in a desktop range; the voice acquisition device is used for monitoring voice information triggered in a desktop scene; the display device is used for displaying the visual information and outputting the video information; the voice broadcasting device is used for outputting audio information.

In the data processing method, the data processing device and the data processing system based on somatosensory interaction in the embodiment of the application, a mode of receiving voice input in an environment and identifying an interactive function selected to be executed by a user is adopted, gesture input in a video is received, action intention is determined based on the interactive function, the video input is continuously received, effective information in an action area is extracted based on the action intention, the purpose of feeding back an interactive result matched with the effective information to the user is achieved, the technical effect that the user carries out somatosensory interaction in a real scene is achieved, and the technical problem that the interactive effect of a desktop system is poor is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a data processing method based on somatosensory interaction according to an embodiment of the application;

FIG. 2 is a data processing device based on somatosensory interaction according to an embodiment of the application;

FIG. 3 is a desktop system based on somatosensory interaction according to an embodiment of the application;

FIG. 4 is a schematic diagram of a desktop system entity based on somatosensory interaction according to an embodiment of the application;

FIG. 5 is a schematic diagram of a desktop system entity based on somatosensory interaction according to another embodiment of the application;

fig. 6 is a schematic diagram of a desktop system entity based on somatosensory interaction according to still another embodiment of the application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, the method includes steps S101 to S104 as follows:

step S101, receiving voice input in the environment, identifying the interactive function selected and executed by the user,

wherein the interactive function is determined by speech recognition.

And when the somatosensory interaction function is started, the somatosensory interaction function is used as interaction of a first level. In this level of interaction, voice data input is received and speech recognition techniques are employed to identify the corresponding interactive function selected for execution by the user.

For example, when the user issues a voice command that i want to learn or enter a learning mode, the user enters the learning function module.

For another example, the user may also use other predefined voice commands to enter the reading module or the game module.

It should be noted that, in the embodiment of the present application, a specific manner for identifying the corresponding interactive function selected by the user to be executed is not particularly limited, as long as the interaction requirement of the first hierarchy can be satisfied.

Step S102, receiving gesture input in a video, and determining an action intention based on the interaction function, wherein the action intention is determined through multi-modal action recognition;

and when the user selects one preset function module, entering the second level interaction. Gesture data input information in the video is received, and the specific action intention of the next step can be determined based on interactive functions such as learning, reading or games.

Specifically, the user can use gesture interaction to analyze the collected gesture video data by utilizing a multi-modal motion recognition technology to determine the specific intention of the user.

For example, as shown in fig. 4, after entering the learning module, if the user places a book below the camera, the user clicks a finger below a certain word of the book. Namely, the user intends to inquire the meaning of the term, and the intention is acquired through the analysis of the collected video data of clicking the book by the finger.

For another example, as shown in fig. 5, when the user draws a line under the text of a sentence in the book, the user intends to explain the sentence.

For another example, as shown in FIG. 6, when the user makes a circle on a certain task in the book, the user intends to solve the task.

It should be noted that the gestures may represent different intentions when the system enters different functional modules, and are not specifically limited in embodiments of the present application. For example, under the learning function module, drawing lines to represent the intention of translating sentences; but under the reading module, the line may represent the intent to speak the sentence; in the gaming mode, the drawn line may represent an intent to identify the card.

Step S103, continuing to receive video input, extracting effective information in the action area based on the action intention,

wherein the effective information refers to the user intention obtained by image retrieval and language character recognition.

When the user's intention is obtained by analyzing the gesture image data, the image video data input is continued as an entry into the third layer interaction. And extracting effective information in the action area based on the action intention, and finishing information retrieval and feedback by combining a large-scale image retrieval technology and a mixed multi-language character recognition technology.

For example, a user double-clicks an icon of a book with a finger, the system understands that the user intends to play audio corresponding to the icon through a multi-modal motion recognition technology, then analyzes the position of the finger, further extracts image data of the icon, and matches the image data with all image data in an image library by using a large-scale image retrieval technology.

And step S104, feeding back the interaction result matched according to the effective information to the user.

And when the matching is successful, feeding back the interaction result matched according to the effective information to the user, namely playing the audio data corresponding to the icon.

The above method may include three levels. At a first level, function selection is performed through a voice recognition technology; the second level, the intention analysis is completed through multi-modal motion recognition; and in the third level, information retrieval and feedback are completed through a large-scale image retrieval technology and a mixed multi-language character recognition technology.

From the above description, it can be seen that the following technical effects are achieved by the present application:

in the embodiment of the application, the mode of receiving voice input in an environment and identifying the interactive function selected and executed by a user is adopted, gesture input in a video is received, the action intention is determined based on the interactive function, the video input is continuously received, effective information in an action area is extracted based on the action intention, the purpose of feeding back the interactive result matched with the effective information to the user is achieved, the technical effect that the user can carry out body sensing interaction in a real scene is achieved, and the technical problem that the interactive effect of a desktop system is poor is solved.

According to the embodiment of the application, as a preferred feature in the embodiment, the receiving the voice input in the environment and the recognizing the interactive function selected to be executed by the user include: obtaining a voice instruction through voice input in an environment; and identifying the interactive function of learning, reading or game selected and executed by the user according to the voice instruction.

The voice instruction is obtained by collecting voice input in the environment, and the interactive function of learning, reading or games selected and executed by the user can be identified through the voice instruction.

According to the embodiment of the application, as a preferred feature in the embodiment, the receiving a gesture input in a video, and determining an action intention based on the interaction function includes: receiving gesture input in a video to obtain user intention, and determining recognized words, recognized sentences or solved problems based on a learning interaction function; and/or receiving gesture input in the video to obtain user intention, and determining point reading of a drawing area or accompanying reading of teaching materials based on the reading interaction function; and/or receiving gesture input in the video to obtain the user intention, and determining card identification, line identification or module identification based on the game interaction function.

Through the input of gesture video data, when gesture input in a video is received, user intention can be obtained, and words, sentences or solution problems are determined to be recognized based on the learning interaction function;

receiving gesture input in a video to obtain user intention through input of gesture video data, and determining point reading or teaching material accompanying reading of a drawing area based on a reading interaction function;

through the input of gesture video data, gesture input in the video is received to obtain user intention, and card identification, line identification or module identification is determined based on the game interaction function.

According to the embodiment of the present application, as a preference in the embodiment, the continuously receiving the video input and extracting the valid information in the action region based on the action intention includes: continuing to receive the video input, extracting a gesture motion track and a gesture position in the action area based on the action intention,

wherein the gesture motion profile comprises: at least one of a single-tap gesture trajectory, a sweeping gesture trajectory, or a sweeping circle trajectory.

By continuing to receive video input, and extracting a single-click gesture track, a gesture track or a gesture movement track and a gesture position in the action area based on the action intention.

According to the embodiment of the present application, as a preferred preference in the embodiment, the feeding back the interaction result matched according to the effective information to the user includes: and displaying the interaction result of learning, reading or game matched according to the effective information to the user.

The interaction with the machine is achieved through limb actions, so that the machine can understand behaviors and intentions of a user, and the purpose of helping the user to obtain required information is achieved. And displaying the interaction result of learning, reading or game matched according to the effective information to the user.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

According to an embodiment of the present application, there is also provided a data processing apparatus based on somatosensory interaction for implementing the above method, as shown in fig. 2, the apparatus includes: a first interactive module 10, configured to receive a voice input in an environment, and recognize an interactive function selected to be executed by a user, where the interactive function is determined by voice recognition; the second interaction module 11 is used for receiving gesture input in the video and determining action intention based on the interaction function, wherein the action intention is determined through multi-modal action recognition; the third interaction module 12 is used for continuously receiving the video input and extracting effective information in the action area based on the action intention, wherein the effective information refers to the user intention obtained through image retrieval and language and character recognition; and the feedback module 13 is configured to feed back the interaction result matched according to the effective information to the user.

In the first interaction module 10 according to the embodiment of the application, when the somatosensory interaction function is started, the somatosensory interaction function is used as interaction of a first level. In this level of interaction, voice data input is received and speech recognition techniques are employed to identify the corresponding interactive function selected for execution by the user.

In the second interaction module 11 of the embodiment of the application, when a user selects a certain preset function module, the second interaction module is used as a second-level interaction. Gesture data input information in the video is received, and the specific action intention of the next step can be determined based on interactive functions such as learning, reading or games.

For example, after entering the learning module, when a user places a book below the camera, if the user clicks a certain word of the book with a finger, as shown in fig. 4, the user intends to inquire the meaning of the word, and the intention is obtained by analyzing the collected video data of the book clicked with the finger.

In the third interaction module 12 of the embodiment of the present application, when the user intention is obtained by analyzing the gesture image data, the image video data is continuously input, and the third layer interaction is considered to be entered. And extracting effective information in the action area based on the action intention, and finishing information retrieval and feedback by combining a large-scale image retrieval technology and a mixed multi-language character recognition technology.

In the feedback module 13 of the embodiment of the present application, when the matching is successful, the audio data corresponding to the icon is played according to the interaction result matched by the effective information, which is fed back to the user.

The module described above may include three levels. At a first level, function selection is performed through a voice recognition technology; the second level, the intention analysis is completed through multi-modal motion recognition; and in the third level, information retrieval and feedback are completed through a large-scale image retrieval technology and a mixed multi-language character recognition technology.

According to the embodiment of the present application, as a preferred option in the embodiment, the first interaction module 10 is configured to obtain a voice instruction through voice input in an environment; and identifying the interactive function of learning, reading or game selected and executed by the user according to the voice instruction.

According to the embodiment of the application, as a preferable preference in the embodiment, the second interaction module 11 is configured to receive gesture input in a video to obtain a user intention, and determine to recognize a word, a sentence or a solution problem based on a learning interaction function; and/or receiving gesture input in the video to obtain user intention, and determining point reading of a drawing area or accompanying reading of teaching materials based on the reading interaction function; and/or receiving gesture input in the video to obtain the user intention, and determining card identification, line identification or module identification based on the game interaction function.

According to the embodiment of the present application, as a preferred feature in the embodiment, the third interaction module 12 is configured to continue to receive the video input, and extract a gesture motion trajectory and a gesture position in the action region based on the action intention, where the gesture motion trajectory includes: at least clicking on the gesture trajectory, sweeping through the gesture trajectory, or sweeping through the circle trajectory.

In another embodiment of the present application, as shown in fig. 3, there is also provided a desktop system based on somatosensory interaction, including: mutual equipment 20 is felt with body to intelligent terminal 10 to make the user in real scene, through body feels mutual equipment with intelligent terminal 10 carries out body and feels alternately, body feels mutual equipment 20 for provide and to carry out body and feel interactive one or more interactive props, intelligent terminal equipment 10 includes: the system comprises an image acquisition device 102, a voice acquisition device 103, a display device 104 and a voice broadcasting device 105, wherein the image acquisition device 102 is used for monitoring image information in a desktop range; the voice acquisition device 103 is used for monitoring voice information triggered in a desktop scene; the display device 104 is used for displaying the visual information and outputting video information; the voice broadcasting device 105 is configured to output audio information.

Specifically, the image capturing device 102 is disposed in the intelligent terminal 10 and is configured to monitor image information within a desktop range. The desktop scope refers to a carrier for implementing somatosensory interaction, and is not particularly limited in the embodiments of the present application. The voice collecting device 103 is disposed in the intelligent terminal 10 and is configured to monitor voice information triggered in a desktop scene. The triggered voice information mainly refers to the voice of people in the environment, and corresponding noise interference is removed. The display device 104 is used for displaying visual information and outputting video information, and can be presented by a display screen or a projection mode. The voice broadcasting device 105 is configured to output audio information so as to interact with or respond to a user.

It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A data processing method based on somatosensory interaction is characterized by comprising the following steps:

receiving voice input in an environment, and identifying an interactive function selected and executed by a user, wherein the interactive function is determined by voice recognition;

receiving gesture input in the video, and determining an action intention based on the interactive function, wherein the action intention is determined through multi-modal action recognition;

continuing to receive video input, and extracting effective information in the action area based on the action intention, wherein the effective information refers to user intention obtained through image retrieval and language and character recognition;

and feeding back the interaction result matched according to the effective information to the user.

2. The data processing method based on somatosensory interaction, according to claim 1, wherein voice input in the environment is received, and the recognition of the interactive function selected to be executed by the user comprises:

obtaining a voice instruction through voice input in an environment;

3. The method for data processing based on somatosensory interaction according to claim 1, wherein receiving gesture input in a video and determining action intention based on the interaction function comprises:

4. The somatosensory interaction-based data processing method according to claim 1, wherein continuing to receive video input and extracting valid information in an action area based on the action intention comprises:

continuing to receive the video input, and extracting a gesture motion track and a gesture position in the action area based on the action intention, wherein the gesture motion track comprises: at least one of a single-tap gesture trajectory, a sweeping gesture trajectory, or a sweeping circle trajectory.

5. The data processing method based on somatosensory interaction of claim 1, wherein feeding back the interaction result matched according to the effective information to the user comprises:

and displaying the interaction result of learning, reading or game matched according to the effective information to the user.

6. A data processing device based on somatosensory interaction is characterized by comprising:

the first interactive module is used for receiving voice input in the environment and identifying an interactive function selected and executed by a user, wherein the interactive function is determined by voice recognition;

the second interaction module is used for receiving gesture input in the video and determining action intention based on the interaction function, wherein the action intention is determined through multi-modal action recognition;

the third interaction module is used for continuously receiving video input and extracting effective information in the action area based on the action intention, wherein the effective information refers to the user intention obtained through image retrieval and language and character recognition;

and the feedback module is used for feeding back the interaction result matched according to the effective information to the user.

7. The somatosensory interaction-based data processing device according to claim 6, wherein the first interaction module is configured to perform interaction with a user

Obtaining a voice instruction through voice input in an environment;

8. The somatosensory interaction-based data processing device according to claim 6, wherein the second interaction module is configured to perform

9. The somatosensory interaction-based data processing device according to claim 6, wherein the third interaction module is configured to perform

10. The utility model provides a desktop system based on interactive is felt to body, its characterized in that includes: mutual equipment is felt with body to intelligent terminal and body to make the user in real scene, through body feels mutual equipment with intelligent terminal carries out body and feels alternately, body feels mutual equipment for provide and to carry out body and feel mutual one or more interactive stage property, intelligent terminal equipment includes: an image acquisition device, a voice acquisition device, a display device and a voice broadcasting device,

the image acquisition device is used for monitoring image information in a desktop range;

the voice acquisition device is used for monitoring voice information triggered in a desktop scene;

the display device is used for displaying the visual information and outputting the video information;

the voice broadcasting device is used for outputting audio information.