CN111967334B

CN111967334B - Human body intention identification method, system and storage medium

Info

Publication number: CN111967334B
Application number: CN202010699862.3A
Authority: CN
Inventors: 闫野; 吴竞寒; 印二威; 谢良; 邓宝松; 范晓丽; 罗治国; 闫慧炯; 杨超
Original assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center; National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center; National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2023-04-07
Anticipated expiration: 2040-07-20
Also published as: CN111967334A

Abstract

The invention discloses a human body intention identification method, which comprises the following steps: collecting current human body characteristic signals in real time; generating multi-source data characteristics corresponding to the current human body and fixation point coordinates selected by eyes based on the characteristic signals; recognizing the multi-source data characteristics and the fixation point coordinates selected by eyes, and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates; performing entity extraction on the voice text and the scene image description text to generate entity fragments corresponding to the voice text and the scene image description text; processing the entity fragment by adopting a coreference resolution algorithm to generate a target object; and generating a human intention recognition result based on the voice text, the scene image description text and the target object. Therefore, by adopting the embodiment of the application, the recognition result is obtained after the mouth-eye cooperative interaction information aiming at the specific scene is processed, so that the accuracy of recognizing the human body intention by the machine is improved.

Description

Human body intention identification method, system and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a human body intention identification method, a human body intention identification system and a storage medium.

Background

In recent years, with the rapid development of novel wearable sensing technology and artificial intelligence technology, human-computer natural interaction research such as voice and eye movement, which is developed by a large number of students and research institutions based on physiological interaction media such as eyes and mouths, is emerging. In fact, the interaction between people is a process of combining the mouth and the eyes, and the complementary characteristic of the multivariate media information enables the semantic expression between people to be more efficient and smooth.

In the prior art, the interaction between a person and an operating device (such as a head-mounted display device, a computer, a mobile phone, and other living devices) is mainly an interaction mode of manual operation. For example, when a person interacts with the head-mounted display device, the physical keys may be used to perform operations such as increasing the volume, playing or pausing; when a person interacts with a computer, the user needs to manually operate a keyboard or a specific identifier to play or open the computer. Because the intelligence of the interaction mode is low, the time is wasted, and the efficiency of man-machine interaction is reduced.

Therefore, how to establish the consensus between human and machine and improve the recognition efficiency of the machine to the human intention is a difficult problem to be broken through urgently in academia.

Disclosure of Invention

The embodiment of the application provides a human body intention identification method, a human body intention identification system and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present application provides a human intention identification method, where the method includes:

collecting current human body characteristic signals in real time;

generating multi-source data characteristics corresponding to the current human body and fixation point coordinates selected by eyes based on the characteristic signals;

recognizing the multi-source data characteristics and the fixation point coordinates selected by eyes, and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates;

entity extraction is carried out on the voice text and the scene image description text, and entity fragments corresponding to the voice text and the scene image description text are generated;

processing the entity fragment by adopting a coreference resolution algorithm to generate a target object;

and generating a human intention recognition result based on the voice text, the scene image description text and the target object.

Optionally, after generating the human body intention recognition result, the method further includes:

and displaying the human body intention recognition result and sending the human body intention recognition result to external equipment to control the external equipment to execute functions.

Optionally, the feature signals include an audio signal, a lip image signal, a facial myoelectric signal, and an eye image signal;

the generating of the multi-source data characteristics corresponding to the current human body and the fixation point coordinates selected by the eyes based on the characteristic signals comprises:

respectively carrying out data preprocessing on the audio signal, the lip image signal and the facial myoelectric signal to generate multisource data characteristics corresponding to the current human body;

and extracting the fixation point coordinates of the eye image signals to generate fixation point coordinates selected by eyes corresponding to the current human body.

Optionally, the respectively performing data preprocessing on the audio signal, the lip image signal, and the facial myoelectric signal to generate a multi-source data feature corresponding to the current human body includes:

performing framing and windowing processing on the audio signal to generate audio signal data characteristics;

extracting a Mel cepstrum coefficient of the facial electromyographic signals to generate facial electromyographic signal data characteristics;

performing gray scale image conversion on the lip image signals, and filtering by using a filter to generate lip image signal data characteristics;

and determining the audio signal data characteristics, the facial myoelectric signal data characteristics and the lip image signal data characteristics as the multi-source data characteristics corresponding to the current human body.

Optionally, the performing fixation point coordinate extraction on the eye image signal to generate a fixation point coordinate selected by an eye corresponding to the current human body includes:

and inputting the eye image signal into a pre-trained fixation point mapping model to generate fixation point coordinates selected by eyes corresponding to the current human body.

Optionally, the recognizing the multi-source data features and the gaze point coordinates selected by the eyes, and generating a voice text corresponding to the multi-source data features and a scene image description text corresponding to the gaze point coordinates include:

carrying out dense coding on the multi-source data characteristics to generate coded multi-source data characteristics;

inputting the coded multi-source data characteristics into a pre-trained Bert network model to generate voice information corresponding to the multi-source data characteristics;

performing text synthesis on the voice information corresponding to the multi-source data characteristics by using an n-gram language model of a cluster search algorithm to generate a voice text corresponding to the multi-source data characteristics;

and coding the fixation point coordinate selected by the eyes to generate a scene image description text corresponding to the fixation point coordinate.

Optionally, the encoding the gaze point coordinate selected by the eye to generate a scene image description text corresponding to the gaze point coordinate includes:

generating a scene image selected by the eyes according to the fixation point coordinates selected by the eyes;

sequentially carrying out image segmentation, target detection and coordinate information identification on the scene image by using a Fast R-CNN algorithm of ResNet101 to generate coding information;

and performing coding modeling based on the coding information to generate a scene image description text corresponding to the gazing point coordinate.

Optionally, the generating a human intention recognition result based on the voice text, the scene image description text, and the target object includes:

performing text semantic analysis on the voice text, the scene image description text and the target object to generate a text code;

associating the text codes with predefined tuples to generate executable instantiation tuples;

generating a semantic analysis result and a representation result according to the instantiation tuple;

and determining the semantic analysis result and the characterization result as human intention recognition results.

In a second aspect, an embodiment of the present application provides a human intention recognition system, including:

the signal acquisition module is used for acquiring the characteristic signals of the current human body in real time;

the data generation module is used for generating multi-source data characteristics corresponding to the current human body and fixation point coordinates selected by eyes based on the characteristic signals;

the text generation module is used for identifying the multi-source data characteristics and the fixation point coordinates selected by eyes and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates;

the entity extraction module is used for carrying out entity extraction on the voice text and the scene image description text to generate entity fragments corresponding to the voice text and the scene image description text;

the target object generation module is used for processing the entity fragments by adopting a coreference resolution algorithm to generate a target object;

and the recognition result generation module is used for generating a human body intention recognition result based on the voice text, the scene image description text and the target object.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the embodiment of the application, a human body intention recognition system firstly collects a characteristic signal of a current human body in real time, then generates a multi-source data characteristic corresponding to the current human body and a fixation point coordinate selected by eyes based on the characteristic signal, then recognizes the multi-source data characteristic and the fixation point coordinate selected by the eyes, generates a voice text corresponding to the multi-source data characteristic and a scene image description text corresponding to the fixation point coordinate, then performs entity extraction on the voice text and the scene image description text, generates an entity segment corresponding to the voice text and the scene image description text, processes the entity segment by adopting a coreference resolution algorithm, generates a target object, and finally generates a human body intention recognition result based on the voice text, the scene image description text and the target object. According to the method and the device, the voice characteristic signals and the facial characteristic signals of the human body are collected in real time to carry out fusion interaction, so that redundancy and ambiguity of multi-mode interaction information are effectively overcome, semantics in the information are enriched, consensus among human and machines is established, and the efficiency of recognizing human intentions by a machine is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic flowchart of a human body intention identification method provided in an embodiment of the present application;

fig. 2 is a schematic process diagram of human body intention recognition provided by an embodiment of the present application;

FIG. 3 is a flowchart of a framework for human intent recognition provided by an embodiment of the present application;

FIG. 4 is a system diagram of a human intent recognition system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Up to now, the interaction between a person and an operating device (e.g. a head-mounted display device, a computer, a mobile phone, etc.) is mainly an interaction mode by manual operation. For example, when a person interacts with the head-mounted display device, the physical keys may be used to perform operations such as increasing the volume, playing or pausing; when a person interacts with a computer, the user needs to manually operate a keyboard or a specific identifier to play or open the computer. Because the intelligence of the interaction mode is low, the time is wasted, and the efficiency of man-machine interaction is reduced. Therefore, how to establish the consensus between human and machine and improve the recognition efficiency of the machine to the human intention is a difficult problem to be broken through urgently in academia. Accordingly, the present application provides a human intention recognition method, system and storage medium to solve the above-mentioned problems associated with the related art. In the technical scheme provided by the application, the voice characteristic signals and the facial characteristic signals of the human body are collected in real time for fusion and interaction, so that redundancy and ambiguity of multi-mode interaction information are effectively overcome, semantics in the information are enriched, consensus between human and machines is established, and the efficiency of recognizing human body intentions by a machine is improved.

The human intention recognition method provided by the embodiment of the present application will be described in detail below with reference to fig. 1 to 3. The method may be implemented in dependence on a computer program, operable on a human intent recognition system based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Referring to fig. 1, a flow chart of a human body intention identification method is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the steps of:

s101, collecting characteristic signals of a current human body in real time;

the current human body is a user who carries out human-computer interaction at the current moment, and the characteristic signals are audio signals, lip image signals, face electromyogram signals and eye image signals generated by the current user.

In a possible implementation manner, when a user performs human-computer interaction, the user performs language expression according to the intention of the user, and at this time, the human intention recognition device synchronously acquires an audio signal input by voice, a lip image signal of the user, a facial myoelectric signal of the user and eye image signal data.

S102, generating multi-source data characteristics corresponding to the current human body and fixation point coordinates selected by eyes based on the characteristic signals;

in the embodiment of the application, the human body intention recognition device performs data preprocessing on an audio signal, a lip image signal and a facial myoelectric signal which are synchronously acquired at first to generate a voice signal characteristic (multi-source data characteristic) corresponding to the current human body, and then performs fixation point coordinate extraction on the eye image signal to generate fixation point coordinates selected by eyes corresponding to the current human body.

In a possible implementation manner, firstly, the audio signal is subjected to framing and windowing processing to generate audio signal data characteristics, then Mel Frequency Cepstrum Coefficient (MFCC) of the facial electromyogram signal is extracted to generate facial electromyogram signal data characteristics, then the lip image signal is subjected to gray scale map conversion, filtering is performed by using a filter to generate lip image signal data characteristics, then the audio signal data characteristics, the facial electromyogram signal data characteristics and the lip image signal data characteristics are determined as voice signal characteristics (multi-source data characteristics) corresponding to the current human body, and finally the eye image signal is input into a pre-trained fixation point mapping model to generate fixation point coordinates selected by eyes corresponding to the current human body.

Furthermore, when the fixation point coordinates are obtained by predicting according to the eye image signals, the eye movement fixation point mapping model based on Deep Convolutional Neural Network (DCNN) is used for prediction. The deep convolutional neural network is formed by stacking a plurality of convolutional layers, pooling layers and full-connection layers, inputs eye image data and outputs fixation point coordinate information. The convolution layer internally comprises a plurality of convolution kernels, each element forming the convolution kernels corresponds to a weight coefficient and a deviation value (bias vector), the neuron is similar to a feedforward neural network, each neuron in the convolution layers is connected with a plurality of neurons of an area close to the position in the previous layer, the size of the area depends on the size of the convolution kernels, the convolution kernels can regularly sweep the input characteristic when in work, matrix element multiplication summation is carried out on the input characteristic in a receptive field, the deviation value is superposed, and the coordinate information calculation formula of the fixation point is as follows:

wherein b is the deviation amount, Z ^l And Z ^l+1 Represents the convolutional input and output of the L +1 th layer, also called feature map, L _l+1 Is Z _l+1 The feature pattern length and width are assumed to be the same. Z (i, j) corresponds to the pixel of the feature map, K is the channel number of the feature map, f, s ₀ And p is a convolutional layer parameter, corresponding to convolutional kernel size, convolutional step size (stride), and number of padding (padding) layers. The representation of the pooling layer is:

wherein the step length s ₀ Pixel (i, j) has the same meaning as the convolution layer, and p is a pre-specified parameter. When p =1, it is called mean pooling (averaging pooling); when p → ∞ is said to be very large pooling (max pooling).

S103, recognizing the multi-source data characteristics and the fixation point coordinates selected by eyes, and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates;

in the embodiment of the application, firstly, a pre-trained voice recognition model and voice signal characteristics are utilized to determine voice information, then the voice information is synthesized into a voice text, a fixation point coordinate is obtained according to an eye image signal, the fixation point coordinate information is coded, and finally, a multi-angle description text of a scene image selected by eyes is obtained.

In a possible implementation mode, the human body intention recognition device carries out dense coding on multi-source data characteristics to generate coded multi-source data characteristics, then inputs the coded multi-source data characteristics into a Bert network model trained in advance to generate voice information corresponding to the multi-source data characteristics, then carries out text synthesis on the voice information corresponding to the multi-source data characteristics by using an n-gram language model of a cluster search algorithm to generate voice texts corresponding to the multi-source data characteristics, and finally codes the fixation point coordinates selected by eyes to generate scene image description texts corresponding to the fixation point coordinates.

In the generation of the scene image description text, firstly, a scene image selected by eyes is generated according to fixation point coordinates selected by the eyes, then image segmentation, target detection and coordinate information identification are sequentially carried out on the scene image by utilizing the Fast R-CNN algorithm of ResNet101 to generate coding information, and finally, coding modeling is carried out based on the coding information to generate the scene image description text corresponding to the fixation point coordinates.

Specifically, when a voice text and a scene image description text are synthesized, dense coding based on front-back association is firstly carried out on multi-source data characteristics of a space domain and a time domain, the multi-source data characteristics are coded and then input into a BERT (Bidirectional Encoder retrieval from transform) network model, a cross-modal and multi-level attention mechanism is used, different modalities are interacted and cooperatively output in a decoding process to obtain cooperative voice information of different modalities, and then an n-gram language model based on a cluster search algorithm is used to obtain a multi-source cooperative voice information synthesis text.

For example, there is a sentence S = (w) consisting of n words ₁ ,w ₂ ,w ₃ ,…,w _n ) Each word w _n All dependent on the word from the first to the word preceding itInfluence, then the probability of sentence S occurrence is:

p(S)＝p(w ₁ w ₂ w ₃ …w _n )＝p(w ₁ )p(w ₂ |w ₁ )…p(w _n |w _n-1 …w ₂ w ₁ )。

and finally, predicting according to the eye image signal to obtain a fixation point coordinate, and encoding fixation point coordinate information to generate a multi-angle description text of the scene image selected by eye movement.

Further, when the fixation point coordinate information is coded to generate a multi-angle description text of the scene image selected by eyes, the scene image selected by eyes is generated according to the fixation point coordinate selected by eyes, then image segmentation, target detection and coordinate information identification are sequentially carried out on the scene image by utilizing the Fast R-CNN algorithm of ResNet101 to generate coded information, and finally, coding modeling is carried out based on the coded information to generate the scene image description text corresponding to the fixation point coordinate.

S104, performing entity extraction on the voice text and the scene image description text to generate entity fragments corresponding to the voice text and the scene image description text;

in a possible implementation manner, the speech text and the scene image description text are input into a pre-trained BERT (Bidirectional Encoder retrieval from Transformer) network model for entity extraction, so as to obtain an entity fragment.

For example, as shown in fig. 2, the human intention recognition apparatus acquires voice audio data, lip image data, facial myoelectricity data, and eye image data of a user through a data acquisition module, then sends the acquired signal data to a data processing module to perform multi-source voice data to text conversion and eye movement selection object image data to text conversion, generates a voice text and a scene image text, inputs the voice text and the scene image text into an entity representation module to perform coreference resolution and semantic analysis and representation, finally generates a human intention recognition result, and finally sends the human intention recognition result to an interaction module to perform interaction.

S105, processing the entity fragment by adopting a coreference resolution algorithm to generate a target object;

in the embodiment of the application, the target object is generated by processing based on an entity coreference resolution algorithm of the menton-Pair, and the menton-Pair model reconstructs the coreference resolution problem into a classification task: a classifier is trained to determine whether a pair of entities is co-referred. In other words, resolving entity mj can be viewed as finding entity mi to maximize the probability of random variable L, i.e.: argmax _ mi P (L | mj, mi), argmax _ mi P is the probability P of the maximized random variable, L is the random variable, mj is the resolving entity, mi is the seeking entity.

And S106, generating a human body intention recognition result based on the voice text, the scene image description text and the target object.

In a possible implementation manner, the human body intention recognition device performs text semantic analysis on a voice text, a scene image description text and a target object to generate a text code, associates the text code with a predefined tuple to generate an executable instantiation tuple, generates a semantic analysis result and a representation result according to the instantiation tuple, and determines the semantic analysis result and the representation result as a human body intention recognition result.

Further, the human intent recognition device performs text semantic analysis on the voice text, the scene image description text and the target object, and the encoding for generating the text specifically includes: the method comprises the steps of firstly obtaining syntactic information by using dependency syntactic analysis based on graph neural network learning, then integrating by using an Embedding mode to obtain part-of-speech information integration, then coding and representing a text by using BERT, and finally integrating according to the part-of-speech information to obtain coded representation of the text.

Further, the human intention recognition device associates the encoding of the text with the predefined tuple, and when generating the executable instantiated tuple, specifically: firstly, using a general predefined tuple to correspond the coded representation of the text to the predefined tuple, then using a double affine (Biaffine) relation classification algorithm to associate all the partially filled tuples, and then obtaining an executable instantiation tuple sequence which finally reflects the deep semantics of the whole sentence according to the specific target object and the associated filled tuples. Computing, by a dual-radial (Biaffine) relationship classification algorithm, a dependent arc score between two tuples:

wherein

Is tuple i as core tuple, < >>

For tuple j as a dependency tuple, W and V are weight vectors used to compute the dependency arc score between two tuples. Calculating the scores of various dependency relationship types on a certain tuple dependency arc by a double-radiation (Biaffine) relationship classification algorithm:

wherein W ₁ And W ₂ Is the weight vector that computes the dependency type scores on the two tuple dependency arcs, and b is the bias.

Furthermore, training a corpus guide model labeled with a certain scale is also included in the semantic analysis and characterization.

For example, as shown in fig. 3, when generating a speech text, feature extraction is performed on a user speech signal, a myoelectric signal and a lip image signal, then dense coding is performed on the extracted features, then the dense coded feature signals are input into a BERT model to generate speech information, and then the speech information is processed by using an n-gram language model of a bundle search algorithm to generate the speech text.

When a scene image description text is generated, firstly, eye image signal data of a user is extracted, then, the eye image signal data is input into a pre-trained eye movement and fixation point mapping model of a Deep Convolutional Neural Network (DCNN), fixation point coordinates selected by eyes corresponding to a current human body are generated, then, a scene image selected by the eyes is generated according to the fixation point coordinates selected by the eyes, image segmentation, target detection and coordinate information identification are sequentially carried out on the scene image by utilizing a Fast R-CNN algorithm of ResNet101, coding information is generated, and finally, coding modeling is carried out based on the coding information, and the scene image description text corresponding to the fixation point coordinates is generated.

After a voice text and a scene image description text are generated, obtaining syntactic information by using dependency syntactic analysis based on graph neural network learning, then integrating by using an Embedding mode to obtain part-of-speech information integration, then encoding by using BERT to represent the text, finally integrating according to the part-of-speech information to obtain encoded representation of the text, and finally obtaining an executable instantiation tuple sequence reflecting deep semantics of the whole sentence according to the specific target object and the associated filling tuple. And calculating a dependency arc score between the two tuples through a double radiation (Biaffine) relation classification algorithm, and finally performing semantic analysis and characterization.

The following are embodiments of systems of the present invention that may be used to perform embodiments of methods of the present invention. For details which are not disclosed in the embodiments of the system of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 4, a schematic structural diagram of a human intention recognition system according to an exemplary embodiment of the invention is shown. The human intent recognition system may be implemented as all or part of an electronic device, in software, hardware, or a combination of both. The system 1 includes a signal acquisition module 10, a data generation module 20, a text generation module 30, an entity extraction module 40, a target object generation module 50, and a recognition result generation module 60.

The signal acquisition module 10 is used for acquiring the characteristic signals of the current human body in real time;

a data generating module 20, configured to generate, based on the feature signal, a multi-source data feature corresponding to the current human body and a fixation point coordinate selected by an eye;

a text generating module 30, configured to identify the multi-source data features and the gaze point coordinates selected by the eyes, and generate a voice text corresponding to the multi-source data features and a scene image description text corresponding to the gaze point coordinates;

an entity extraction module 40, configured to perform entity extraction on the voice text and the scene image description text, and generate entity segments corresponding to the voice text and the scene image description text;

a target object generation module 50, configured to process the entity fragment by using a coreference resolution algorithm, so as to generate a target object;

and a recognition result generating module 60, configured to generate a human intent recognition result based on the voice text, the scene image description text, and the target object.

It should be noted that, when the human intention identifying system provided in the foregoing embodiment executes the human intention identifying method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed to different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the human body intention recognition system provided by the above embodiment and the human body intention recognition method embodiment belong to the same concept, and the detailed implementation process thereof is referred to the method embodiment, which is not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the embodiment of the application, a human intention recognition system firstly collects characteristic signals of a current human body in real time, then generates multi-source data characteristics corresponding to the current human body and fixation point coordinates selected by eyes based on the characteristic signals, then recognizes the multi-source data characteristics and the fixation point coordinates selected by the eyes, generates a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates, then performs entity extraction on the voice text and the scene image description text, generates entity fragments corresponding to the voice text and the scene image description text, processes the entity fragments by adopting a coreference resolution algorithm, generates a target object, and finally generates a human intention recognition result based on the voice text, the scene image description text and the target object. According to the method and the device, the voice characteristic signals and the facial characteristic signals of the human body are collected in real time to carry out fusion interaction, so that redundancy and ambiguity of multi-mode interaction information are effectively overcome, semantics in the information are enriched, consensus among human and machines is established, and the efficiency of recognizing human intentions by a machine is improved.

The present invention also provides a computer readable medium, on which program instructions are stored, which when executed by a processor implement the human intent recognition method provided by the various method embodiments described above.

The present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the human intent recognition method described in the various method embodiments above.

Please refer to fig. 5, which is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.

The communication bus 1002 is used to implement connection communication among these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 1001 may include one or more processing cores, among other things. The processor 1001 interfaces various components throughout the electronic device 1000 using various interfaces and lines to perform various functions of the electronic device 1000 and to process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005 and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory system located remotely from the processor 1001. As shown in fig. 5, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a human intention recognition application.

In the electronic device 1000 shown in fig. 5, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the human intent recognition application stored in the memory 1005 and specifically perform the following operations:

collecting current human body characteristic signals in real time;

performing entity extraction on the voice text and the scene image description text to generate entity fragments corresponding to the voice text and the scene image description text;

In one embodiment, the processor 1001, after performing the generating of the human intention recognition result, further performs the following operations:

In an embodiment, when the processor 1001 performs the generation of the multi-source data feature corresponding to the current human body and the fixation point coordinate selected by the eye based on the feature signal, specifically performs the following operations:

In an embodiment, when the processor 1001 performs the data preprocessing on the audio signal, the lip image signal, and the facial myoelectric signal, respectively, to generate the multi-source data feature corresponding to the current human body, the following operations are specifically performed:

converting a grey scale map of the lip image signal, and filtering by using a filter to generate lip image signal data characteristics;

In an embodiment, when performing the gaze point coordinate extraction on the eye image signal to generate the gaze point coordinate selected by the eye corresponding to the current human body, the processor 1001 specifically performs the following operations:

In an embodiment, when the processor 1001 performs the recognition of the multi-source data feature and the gaze point coordinate selected by the eye, and generates the voice text corresponding to the multi-source data feature and the scene image description text corresponding to the gaze point coordinate, the following operations are specifically performed:

In an embodiment, when the processor 1001 performs the encoding of the gaze point coordinates selected by the eye to generate the scene image description text corresponding to the gaze point coordinates, the following operations are specifically performed:

In one embodiment, when the processor 1001 executes the generation of the human intention recognition result based on the voice text, the scene image description text, and the target object, the following operations are specifically performed:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A human intent recognition method, the method comprising:

collecting current human body characteristic signals in real time;

recognizing the multi-source data characteristics and the fixation point coordinates selected by eyes, and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates; wherein the content of the first and second substances,

the recognizing the multi-source data characteristics and the fixation point coordinates selected by eyes, and generating the voice text corresponding to the multi-source data characteristics and the scene image description text corresponding to the fixation point coordinates comprise:

coding the fixation point coordinate selected by the eyes to generate a scene image description text corresponding to the fixation point coordinate; wherein the content of the first and second substances,

the encoding the fixation point coordinate selected by the eye to generate a scene image description text corresponding to the fixation point coordinate includes:

performing coding modeling based on the coding information, and generating a scene image description text corresponding to the fixation point coordinates;

generating a human body intention recognition result based on the voice text, the scene image description text and the target object; wherein the content of the first and second substances,

generating a human intention recognition result based on the voice text, the scene image description text and the target object, wherein the generating comprises the following steps:

2. The method according to claim 1, wherein after generating the human intent recognition result, further comprising:

3. The method according to claim 1 or 2, wherein the feature signal includes an audio signal, a lip image signal, a face electromyogram signal, and an eye image signal;

4. The method according to claim 3, wherein the pre-processing the audio signal, the lip image signal and the facial myoelectric signal to generate the multi-source data feature corresponding to the current human body comprises:

converting a gray scale image of the lip image signal, and filtering by using a filter to generate lip image signal data characteristics;

5. The method according to claim 3, wherein the performing gaze point coordinate extraction on the eye image signal to generate gaze point coordinates selected by the eye corresponding to the current human body comprises:

6. A human intent recognition system, the system comprising:

the text generation module is used for identifying the multi-source data characteristics and the fixation point coordinates selected by eyes and generating a voice text corresponding to the multi-source data characteristics and a scene image description text corresponding to the fixation point coordinates; wherein the content of the first and second substances,

the text generation module is specifically configured to:

the entity extraction module is used for performing entity extraction on the voice text and the scene image description text to generate entity fragments corresponding to the voice text and the scene image description text;

the recognition result generation module is used for generating a human body intention recognition result based on the voice text, the scene image description text and the target object; wherein, the first and the second end of the pipe are connected with each other,

the identification result generation module is specifically configured to:

7. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any one of claims 1 to 5.