CN108986801B

CN108986801B - Man-machine interaction method and device and man-machine interaction terminal

Info

Publication number: CN108986801B
Application number: CN201710408396.7A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Current assignee: South China University of Technology SCUT; Tencent Technology Shenzhen Co Ltd
Priority date: 2017-06-02
Filing date: 2017-06-02
Publication date: 2020-06-05
Anticipated expiration: 2037-06-02
Also published as: WO2018219198A1; CN108986801A

Abstract

The embodiment of the invention provides a man-machine interaction method, a man-machine interaction device and a man-machine interaction terminal, wherein the method comprises the following steps: acquiring control information conveyed by a user, wherein the control information comprises voice information; extracting text features of the voice information; determining a text feature vector corresponding to the text feature; determining a voice sample matched with the text feature vector according to a pre-trained voice classification model; the voice classification model represents the attribution probability of a text feature vector and a corresponding voice sample; taking the voice control instruction corresponding to the determined voice sample as the voice control instruction of the voice information; and generating a target control instruction according to the voice control instruction. The embodiment of the invention can improve the naturalness and intelligence of the human-computer interaction and reduce the user threshold of the human-computer interaction, thereby providing powerful support for the popularization of the human-computer interaction.

Description

Man-machine interaction method and device and man-machine interaction terminal

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a human-computer interaction method, a human-computer interaction device and a human-computer interaction terminal.

Background

Human-computer interaction refers to a technology for enabling a user to communicate with a machine, so that the machine can understand the intention of the user; specifically, through human-computer interaction, a user may enable the machine to perform work intended by the user by communicating control information to the machine. Human-computer interaction has wide application in a plurality of fields, relates to aspects such as mobile phone control, automobile automatic driving and the like, and particularly, along with the development of robot (such as a server robot) technology, how to better apply the human-computer interaction technology in the aspect of robot control becomes a key point for improving the robot technology.

The inventor of the invention finds that the problem to be solved by the existing human-computer interaction technology is how to improve the naturalness and intelligence of human-computer interaction, so that the user threshold of human-computer interaction is reduced, and the human-computer interaction technology can be widely popularized.

Disclosure of Invention

In view of this, embodiments of the present invention provide a human-computer interaction method, a human-computer interaction device, and a human-computer interaction terminal, so as to improve naturality and intelligence of human-computer interaction, reduce a user threshold of human-computer interaction, and provide a strong support for popularization of human-computer interaction.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

a human-computer interaction method, comprising:

acquiring control information conveyed by a user, wherein the control information comprises voice information;

extracting text features of the voice information;

determining a text feature vector corresponding to the text feature;

determining a voice sample matched with the text feature vector according to a pre-trained voice classification model; the voice classification model represents the attribution probability of a text feature vector and a corresponding voice sample;

taking the voice control instruction corresponding to the determined voice sample as the voice control instruction of the voice information;

and generating a target control instruction according to the voice control instruction.

An embodiment of the present invention further provides a human-computer interaction device, including:

the control information acquisition module is used for acquiring control information conveyed by a user, and the control information comprises voice information;

the text feature extraction module is used for extracting text features of the voice information;

the text feature vector determining module is used for determining a text feature vector corresponding to the text feature;

the voice sample determining module is used for determining a voice sample matched with the text feature vector according to a pre-trained voice classification model; the voice classification model represents the attribution probability of a text feature vector and a corresponding voice sample;

the voice instruction determining module is used for taking the voice control instruction corresponding to the determined voice sample as the voice control instruction of the voice information;

and the target instruction generating module is used for generating a target control instruction according to the voice control instruction.

An embodiment of the present invention further provides a human-computer interaction terminal, including: at least one memory and at least one processor;

the memory stores a program, and the processor calls the program; the program is for:

extracting text features of the voice information;

determining a text feature vector corresponding to the text feature;

Based on the technical scheme, the man-machine interaction method provided by the embodiment of the invention can be used for extracting text features of voice information in control information conveyed by a user and determining corresponding text feature vectors; thus, according to the pre-trained speech classification model, the speech sample matched with the text feature vector can be determined; and then the voice control instruction corresponding to the determined voice sample is used as the voice control instruction of the voice information, and the target control instruction is generated through the voice control instruction, so that the generation of the target control instruction for the machine in the man-machine interaction process is realized.

The pre-trained voice classification model can accurately define the probability that each text feature vector belongs to the voice sample with possible intention, so that the corresponding relation between the voice sample and the text feature vector is more accurate; therefore, by means of the embodiment of the invention, a user can carry out human-computer interaction in a human-to-human communication mode, after the user transmits the voice information to the human-computer interaction terminal through natural voice information, the human-computer interaction terminal can accurately identify the voice sample matched with the voice information transmitted by the user by using the voice classification model, and therefore, the voice control instruction of the voice information intention transmitted by the user is identified through the matched voice sample. By utilizing the embodiment of the invention, the way of transmitting the voice information by the user can be more natural, and the human-computer interaction terminal can accurately match the voice sample of the voice information of the user through the voice classification model, so as to realize the accurate determination of the voice control instruction of the voice information intention of the user, thereby improving the naturalness and intelligence of the human-computer interaction, reducing the communication threshold of the human-computer interaction of the user and providing a powerful support for the popularization of the human-computer interaction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of a human-computer interaction system according to an embodiment of the present invention;

FIG. 2 is another block diagram of a human-computer interaction system according to an embodiment of the present invention;

FIG. 3 is a block diagram of a human-machine interaction terminal;

FIG. 4 is a flowchart of a method for constructing a speech classification model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a human-computer interaction method according to an embodiment of the present invention;

FIG. 6 is an exemplary diagram of human interaction;

FIG. 7 is another flowchart of a human-computer interaction method according to an embodiment of the present invention;

FIG. 8 is a flow chart of a method for improving the gesture feature of particle filter processing;

fig. 9 is a flowchart of a target object identification method according to an embodiment of the present invention;

FIG. 10 is a block diagram of a human-computer interaction device according to an embodiment of the present invention;

FIG. 11 is a block diagram of another exemplary embodiment of a human-computer interaction device;

FIG. 12 is a block diagram of another exemplary embodiment of a human-computer interaction device;

FIG. 13 is a block diagram of another exemplary embodiment of a human-computer interaction device;

FIG. 14 is a block diagram of another exemplary embodiment of a human-computer interaction device;

fig. 15 is still another block diagram of a human-computer interaction device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The man-machine interaction method provided by the embodiment of the invention can be applied to the aspects of robot control, mobile phone control, automatic driving and the like; for convenience of explanation, the following description will mainly describe a human-computer interaction method provided by the embodiment of the present invention in terms of service robot control; of course, the use principle of the human-computer interaction method provided by the embodiment of the invention in the aspects of mobile phone control, automatic driving and the like is consistent with the use principle in the aspect of service robot control, and can be referred to each other.

It should be introduced that the service robot is one of the robots, the service robot can be divided into a professional field service robot and a personal and family service robot, and the service robot has a wide application range, and is mainly used for maintenance, repair, transportation, cleaning, security, rescue, monitoring and the like.

Optionally, fig. 1 is a block diagram of an optional structure of a human-computer interaction system provided in an embodiment of the present invention, and referring to fig. 1, the human-computer interaction system may include: a man-machine interaction terminal 10 and a service robot 11; the human-computer interaction terminal 10 and the service robot 11 can realize information interaction through the Internet;

based on the human-computer interaction system shown in fig. 1, a user can transmit control information to a human-computer interaction terminal, the human-computer interaction terminal can transmit the control instruction to a service robot through the internet after understanding a control instruction corresponding to the control information transmitted by the user, and the service robot executes the control instruction to complete the work intended by the user;

optionally, the mode of the user for communicating the control information to the human-computer interaction terminal may be voice; voice combined gestures and the like are also possible;

further, the service robot can transmit the state information of the robot and/or the environment information based on visual perception to the human-computer interaction terminal through the internet, and the human-computer interaction terminal displays the state information of the robot and/or the environment information around the service robot (which can be displayed through a display screen of the human-computer interaction terminal) to a user, so that the user can better convey control information.

The human-computer interaction system shown in fig. 1 can transmit information between the human-computer interaction terminal and the service robot through the internet, so as to realize remote control of the service robot by a user; certainly, fig. 1 shows only an optional structure of the human-computer interaction system, and optionally, the embodiment of the present invention does not exclude the case that a human-computer interaction terminal is built in the service robot, as shown in fig. 2, so that the human-computer interaction terminal can control the service robot to operate through local communication (in the form of local wired or local area network wireless, etc.); the human-computer interaction system shown in fig. 2 may be similar to the human-computer interaction system shown in fig. 1, except that the communication mode is changed from communication through the internet to local communication.

Optionally, the human-computer interaction terminal may be regarded as an interaction platform between the service robot and the user, and a control terminal for controlling the service robot; the man-machine interaction terminal can be arranged separately from the service robot, information interaction is realized through the internet, or the man-machine interaction terminal can be arranged in the service robot, and the man-machine interaction terminal can transmit a corresponding control instruction to the service robot after the control instruction corresponding to the control information of the user is understood, so that control components (such as a motor, a motor and the like) of the service robot are controlled, and the work intended by the user is completed.

In the embodiment of the invention, the human-computer interaction terminal can be loaded with corresponding programs to realize the human-computer interaction method provided by the embodiment of the invention; the program can be stored by a memory of the human-computer interaction terminal and called and executed by a processor of the human-computer interaction terminal; alternatively, fig. 3 shows an alternative structure of the human-computer interaction terminal, and referring to fig. 3, the human-computer interaction terminal may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the present invention, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete mutual communication through the communication bus 4; it is clear that the communication connection illustration of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 shown in fig. 3 is only optional;

the processor 1 may be a central processing unit CPU or an application specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention.

The memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.

The memory 3 stores programs, and the processor 1 calls the programs stored in the memory 3 to realize the human-computer interaction method provided by the embodiment of the invention.

The voice is used as a way for conveying control information commonly used by the user, and the man-machine interaction method provided by the embodiment of the invention is introduced below under the condition that the control information conveyed to the man-machine interaction terminal by the user comprises the voice. The man-machine interaction method described below is applicable to service robot control, mobile phone control, automatic driving, and the like.

In order to improve the naturalness and intelligence of human-computer interaction, it is necessary to enable the service robot to more accurately and quickly understand the intention of the voice of the user, so that the embodiment of the invention considers and constructs an accurate and efficient voice classification model, thereby more accurately and quickly identifying the voice sample corresponding to the voice transmitted by the user, and determining the voice control instruction intended by the voice transmitted by the user according to the voice control instruction corresponding to the voice sample.

Fig. 4 is a flowchart of a method for constructing a speech classification model according to an embodiment of the present invention, where the method for constructing a speech classification model can be implemented by a background server, a trained speech classification model can be imported into a human-computer interaction terminal, and the human-computer interaction terminal identifies a speech sample corresponding to a user's speech; of course, the construction of the voice classification model can also be realized by a human-computer interaction terminal;

referring to fig. 4, the method may include:

step S100, a training corpus is obtained, wherein the training corpus records voice samples of each voice control instruction, and one voice control instruction corresponds to at least one voice sample.

The training corpus records voice samples of all voice control instructions collected in advance in the embodiment of the invention, and one voice control instruction in the training corpus corresponds to at least one voice sample; through the voice samples of the voice control instructions, a voice classification model can be trained by utilizing a machine learning algorithm.

Alternatively, the voice sample may be a natural language of the user, and the voice control instruction may be a control instruction that is converted from a natural voice and that can be understood by the service robot.

And step S110, extracting text features of each voice sample to obtain a plurality of text features.

The embodiment of the invention can extract the text features of each voice sample, and the text features extracted by one voice sample can be at least one, so that a plurality of text features can be obtained by extracting the text features of each voice sample;

optionally, under the condition that text features extracted from different voice samples may be repeated, the repeated text features may be deduplicated, so that the multiple obtained text features do not have repeated text features;

optionally, the text features of the voice sample may be text features in the form of keywords extracted from converted characters after the voice sample is subjected to character conversion.

And step S120, respectively carrying out feature vector weighting on each text feature to obtain a text feature vector of each text feature.

Optionally, in the embodiment of the present invention, feature vector weighting may be performed on each text feature by using TF-IDF (term frequency-inverse document frequency, a technique for information retrieval), so as to obtain a text feature vector corresponding to each text feature and obtain a plurality of text feature vectors.

It should be noted that TF-IDF is a statistical method for evaluating the importance of words to the files in the corpus; the importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the number of times it appears in a corpus;

optionally, for a text feature, the embodiment of the present invention may determine the number of occurrences of words of the text feature in a corresponding speech sample (the speech sample corresponding to the text feature may be considered as the speech sample from which the text feature is extracted) and the number of occurrences in a training corpus, so as to determine the importance degree of the text feature in the corresponding speech sample according to the number of occurrences of the words of the text feature in the corresponding speech sample and the training corpus, and determine a text feature vector corresponding to the text feature according to the importance degree; the importance degree is in direct proportion to the occurrence frequency of the words of the text characteristics in the voice sample, and is in inverse proportion to the occurrence frequency of the words of the text characteristics in the corpus;

optionally, when the text feature vector is obtained, if there are n words, the n-dimensional text feature vector may be correspondingly obtained.

And S130, modeling each text feature vector and the attribution probability of the corresponding voice sample according to a machine learning algorithm to obtain a voice classification model.

Optionally, the number of the voice samples corresponding to one text feature vector may be understood as at least one voice sample that is intended to be expressed by the text feature vector; the attribution probability of the text feature vector and the corresponding voice sample can be regarded as the probability that the text feature vector belongs to the corresponding voice sample;

optionally, because text features extracted from different speech samples may be the same, a text feature may correspond to at least one speech sample, and correspondingly, a text feature vector of a text feature may also correspond to at least one speech sample; the text feature vector of a text feature can represent the importance degree of the text feature in the corresponding voice sample, so that the embodiment of the invention can determine the attribution probability of the text feature vector and each corresponding voice sample according to the importance degree represented by the text feature vector;

modeling each text feature vector and the attribution probability of each text feature vector and each corresponding voice sample by utilizing a machine learning algorithm to obtain a voice classification model; alternatively, the speech classification model may represent the probability of attribution of a text feature vector to a corresponding speech sample.

Through the text feature vectors and the attribution probabilities of the text feature vectors and the corresponding voice samples, the probability that each text feature vector belongs to a voice sample with a possible intention can be accurately defined, so that the corresponding relation between the voice sample and the text feature vectors is more accurate; a voice classification model is obtained through training, the voice sample to which the voice classification model belongs can be accurately determined through the text feature vector of the natural language, and accurate recognition of the voice sample corresponding to the natural language conveyed by a user is realized; therefore, the subsequent voice control instruction of the determined voice sample is used as the voice control instruction of the natural language, the voice control instruction matched with the natural language conveyed by the user can be accurately determined, the recognition accuracy of the voice control instruction of the service robot for the natural language intention of the user is improved, and the possibility is provided for improving the intelligence and the naturalness of human-computer interaction.

Optionally, in the embodiment of the present invention, a maximum entropy algorithm may be used to model the attribution probability of each text feature vector and the corresponding voice control instruction, so as to obtain a maximum entropy classification model (a form of the voice classification model) with uniform probability distribution, where the maximum entropy classification model represents the attribution probability of the text feature vector and the corresponding voice sample;

optionally, during the concrete modeling, the embodiment of the present invention may be implemented by using the following formula;

wherein f is_i(x, y) is a feature function of the ith text feature vector, n is the number of the feature functions, the numerical value of n is consistent with that of the text feature vector, and if the ith text feature vector and the corresponding voice sample appear in the same collected natural language, f is considered to be_i(x, y) is 1, otherwise, f is considered to be_i(x, y) is 0, lambda_iIs f_i(x, y) corresponding weight, lambda is Lagrange multiplier, Z (x) is set normalization factor, p^*Is the representing parameter of the maximum entropy classification model.

Optionally, when modeling is performed by using a maximum entropy algorithm, since the modeling process is realized by using known information, the known information is accorded as much as possible, and no assumption is made on the unknown information, various related or unrelated probabilities can be comprehensively observed, and the performance of the method is superior to that of other machine learning algorithms such as bayes and the like when the method is applied to text feature vector classification; the embodiment of the present invention may preferably use a maximum entropy algorithm to establish the speech classification model in the form of the maximum entropy classification model, but this is only a preferred solution, and the embodiment of the present invention does not exclude other machine learning algorithms such as bayes.

After the voice classification model is obtained through training, the voice classification model can be used for processing the control information which is transmitted to the man-machine interaction terminal by the user and contains voice, so that a voice sample matched with the voice transmitted by the user is recognized, and a voice control instruction corresponding to the voice sample is used as a voice control instruction corresponding to the voice transmitted by the user.

Fig. 5 is a flowchart of a human-computer interaction method provided in an embodiment of the present invention, where the method is applicable to a human-computer interaction terminal, and referring to fig. 5, the method may include:

step S200, acquiring control information conveyed by a user, wherein the control information comprises voice information.

Optionally, the human-computer interaction terminal may obtain control information conveyed by a user through a setting detector, where the control information may include voice information conveyed by the user or may include gesture information conveyed by the user; this embodiment discusses a case where the control information includes voice information, which will be described later for a case where gesture information is included;

optionally, the detector may be in the form of a voice detector such as a microphone, a non-contact image detector such as a stereo camera or an infrared imager, or the like; the form of the detector may be set according to the type of the control information, and is not limited fixedly.

And step S210, extracting the text features of the voice information.

Optionally, in the embodiment of the present invention, the voice information may be subjected to text conversion, and corresponding text features are extracted from the converted text, so as to obtain the text features of the voice information.

And S220, determining a text feature vector corresponding to the text feature.

Optionally, in the embodiment of the present invention, a text feature vector corresponding to the text feature may be determined through TF-IDF; optionally, when determining the text feature vector corresponding to the text feature, the embodiment of the present invention may determine the text feature vector corresponding to the text feature by combining the speech information and the training corpus.

And step S230, determining a voice sample matched with the text feature vector according to the pre-trained voice classification model.

Optionally, because the pre-trained speech classification model may represent the attribution probabilities of the text feature vectors and the corresponding speech samples, through the speech classification model, the embodiment of the present invention may determine the speech samples to which the text feature vectors may belong and the attribution probabilities of the speech samples that may belong, so that the speech sample with the highest attribution probability may be selected as the speech sample matched with the text feature vectors.

Step S240, using the voice control command corresponding to the determined voice sample as the voice control command of the voice information.

And step S250, generating a target control instruction according to the voice control instruction.

The target control instruction can be a final control instruction which is generated by the man-machine interaction terminal and aims at the service robot, and on the basis of independently using voice control, the voice control instruction can be directly used as the target control instruction; and under the condition that the user combines the gesture, the gesture control instruction corresponding to the gesture of the user is also used as a parameter of the target control instruction, so that the target control instruction is generated by combining the voice control instruction expressed by the voice of the user and the gesture control instruction expressed by the gesture of the user.

Of course, when a target object (which may be considered as an object operated by the service robot based on user control) in an environment scene where the service robot is located needs to be controlled (which is only an optional control situation), the target object may also be identified in combination with the environment scene, so that the service robot performs a control operation corresponding to a target control instruction for the identified target object.

The man-machine interaction method provided by the embodiment of the invention can be used for extracting the text characteristic of the voice information in the control information conveyed by a user and determining the corresponding text characteristic vector; thus, according to the pre-trained speech classification model, the speech sample matched with the text feature vector can be determined; and then the voice control instruction corresponding to the determined voice sample is used as the voice control instruction of the voice information, and the target control instruction is generated through the voice control instruction, so that the generation of the target control instruction for the machine in the man-machine interaction process is realized.

An example of performing human-computer interaction by using voice information according to an embodiment of the present invention may be shown in fig. 6, where a user speaks a voice for performing service robot control to a human-computer interaction terminal; after acquiring voice transmitted by a user, a human-computer interaction terminal converts the voice into characters, extracts text features of the characters and determines text feature vectors corresponding to the text features, and determines a voice sample matched with the text feature vectors through a maximum entropy classification model so as to determine a voice control instruction corresponding to the voice sample; the man-machine interaction terminal transmits the voice control instruction to the service robot through the Internet, and the service robot executes the voice control instruction to realize the remote control operation of the service robot by a user. Of course, the human-computer interaction terminal in fig. 6 may be built in the service robot.

The embodiment of the invention can also realize human-computer interaction by combining the user gesture, and in the process of realizing human-computer interaction by combining the user gesture, the human-computer interaction terminal needs to understand the gesture control instruction corresponding to the user gesture, so that the gesture control instruction corresponding to the user gesture is more accurately determined, and the recognition of the user gesture needs to be optimized, so that the accuracy of the recognition of the user gesture is improved, and the accuracy of the determination of the gesture control instruction is assisted to be improved.

In the aspect of improving the accuracy of the gesture recognition of the user, the embodiment of the invention can be realized by improving the recognition accuracy of the gesture position and the recognition accuracy of the gesture posture, so that the determined gesture position and the gesture posture are fused, and the improvement of the recognition accuracy of the gesture of the user is realized.

Correspondingly, in the method shown in fig. 5, the control information conveyed by the user may further include gesture information; optionally, fig. 7 shows another flowchart of a human-computer interaction method provided in an embodiment of the present invention, where the method may be implemented by a human-computer interaction terminal, and referring to fig. 7, the method may include:

step S300, acquiring control information conveyed by a user, wherein the control information comprises voice information and gesture information; the gesture information includes: a gesture location feature and a gesture pose feature.

Optionally, the gesture information may be original gesture feature information represented by a plurality of consecutive frames of user gesture images (such as a sequence of user gesture images), and the original gesture feature information may be extracted from the plurality of consecutive frames of user gesture images, and the original gesture feature information may be represented by a gesture position feature and a gesture posture feature; optionally, the gesture position features are features related to the gesture position, such as coordinates, speed, acceleration and the like of the human hand in three XYZ axes, and the gesture posture features are rotation angles of the human hand about each axis of a coordinate system of the three XYZ axes;

optionally, the user gesture image may be acquired by a non-contact image detector such as a stereo camera or an infrared imager; for example, a stereoscopic vision camera or an infrared imaging sensor can detect and identify hands in real time, and related sensor hardware comprises a binocular vision camera, a Kinect somatosensory sensor and a Leap Motion sensor; taking a Leap Motion sensor as an example, a hand is placed in a detection area, the sensor can acquire a three-dimensional gesture image at a high frequency, and return a rectangular coordinate (a form of gesture position characteristics) of the hand to a Leap Motion base coordinate and a rotation angle (a form of gesture posture characteristics) of a palm relative to a three-coordinate system, so as to obtain gesture information represented by the gesture position characteristics and the gesture posture characteristics;

optionally, the gesture image of the embodiment of the present invention may be in a three-dimensional form, so that the interaction intention of the user may be recognized and converted into an interaction instruction by capturing the three-dimensional gesture of the human hand; different from the traditional two-dimensional gesture interaction, the three-dimensional gesture data has the advantages of rich semantic expression, visual mapping and the like.

And S310, extracting text features of the voice information.

And step S320, determining a text feature vector corresponding to the text feature.

And S330, determining a voice sample matched with the text feature vector according to the pre-trained voice classification model.

And step S340, taking the voice control instruction corresponding to the determined voice sample as the voice control instruction of the voice information.

Alternatively, the processing of steps S310 to S340 may refer to steps S210 to S240 shown in fig. 5; in the method shown in fig. 7, there is also parallel processing of the user gesture images, as follows.

S350, processing the gesture position characteristics according to adaptive interval Kalman filtering to obtain target gesture position characteristics; and processing the gesture attitude characteristics according to the improved particle filtering to obtain target gesture attitude characteristics.

Optionally, while filtering with the measurement data, Adaptive Kalman filtering (Adaptive Kalman Filter) may continuously determine whether the system dynamics changes by the filtering itself, and estimate and correct the model parameters and the noise statistical characteristics to improve the filtering design and reduce the actual error of the filtering; the original gesture position characteristics extracted from the user gesture image are processed through the adaptive interval Kalman filtering, so that the influence of noise of a detector and human hand muscle jitter on the user gesture can be filtered, and the accuracy of the processed gesture position characteristics (namely target gesture position characteristics) is improved;

by improving the quaternion component of the original gesture attitude feature extracted from the user gesture image by Particle filter (Improved Particle filter), the quaternion component of the processed gesture attitude feature (namely the target gesture attitude feature) can be more approximate to the real quaternion component, so that the accuracy of the processed gesture attitude feature is Improved.

And S360, fusing the target gesture position characteristics and the target gesture posture characteristics to determine gesture characteristics of the user.

According to the embodiment of the invention, the target gesture position characteristics processed by adopting the adaptive interval Kalman filtering can be fused with the target gesture characteristics processed by adopting the improved particle filtering, so that the gesture characteristics of a user can be determined; according to the embodiment of the invention, through adaptive interval Kalman filtering and improved particle filtering, the space-time correlation of the gesture position and the gesture can be restrained, so that the non-stability and the ambiguity of three-dimensional gesture data are eliminated as much as possible.

And step S370, determining a gesture control instruction corresponding to the gesture feature.

Optionally, a gesture control instruction library may be set in the embodiment of the present invention, and the gesture characteristics (which may be in a three-dimensional form) corresponding to each gesture control instruction are recorded by the gesture control instruction library, so that after the gesture characteristics of the user gesture image are determined, the gesture control instruction corresponding to the gesture characteristics of the user gesture image may be determined by the gesture characteristics corresponding to each gesture control instruction recorded by the gesture control instruction library in the embodiment of the present invention.

And S380, generating a target control instruction according to the voice control instruction and the gesture control instruction.

Alternatively, steps S350 to S370, and steps S310 to S340 may be parallel, and are respectively directed to the processing of the control information in the form of the user gesture image and the control information in the form of the user voice, and there may be no obvious front-back order between steps S350 to S370, and steps S310 to S340.

Alternatively, the target control instruction for the service robot can be generally described in the form of a control vector consisting of four variables (C)_dir,C_opt,C_vol,C_unit) In which C is_dirTo manipulate the orientation key, C_optAnd C_volFor a stack of operation descriptions, respectively an operation key and an operation value, C_unitFor the unit of operation, the four variables can be referred to as voice control variables; under the general condition, the four variables can be defined through a voice control instruction; correspondingly, when a target control instruction is generated according to a voice control instruction, the embodiment of the present invention may determine a voice control variable corresponding to the voice control instruction, where the voice control variable includes: the method comprises the steps that an operation direction keyword, an operation value corresponding to the operation keyword and an operation unit are indicated by a voice control instruction; therefore, the target control instruction is described by the control vector formed by the voice control variable, and the generation of the target control instruction is realized.

In the case of combining voice and gesture control, the embodiment of the invention can add a new variable C_handNamely, the target control instruction can be modified to be described in the form of a control vector consisting of the following five variables:

(C_dir,C_opt,C_hand,C_val,C_unit)；

and in case of no need of gesture control, C can be considered_hand＝NULL。

Correspondingly, when a target control instruction is generated according to the voice control instruction and the gesture control instruction, the embodiment of the invention can determine the voice control variable corresponding to the voice control instruction, wherein the voice control variable comprises: the method comprises the steps that an operation direction keyword, an operation value corresponding to the operation keyword and an operation unit are indicated by a voice control instruction; simultaneously determining a gesture control variable corresponding to the gesture control instruction; thereby forming a control vector (C) describing the target control instruction in combination with the speech control variable and the gesture control variable_dir,C_opt,C_hand,C_val,C_unit) And realizing the generation of the target control instruction.

Optionally, in the embodiment of the present invention, it may be considered that a user gesture image is captured within a detection range of the image detector, and then it is considered that control gesture control needs to be combined; otherwise, consider notIt is necessary to determine (C) by voice information (possibly also in connection with the environmental scenario of the service robot) in connection with gesture control_dir,C_opt,C_vol,C_unit) The constituent control vectors.

Optionally, the following describes a means for processing the gesture position feature according to the adaptive interval kalman filter. It should be noted that, in the process of acquiring a gesture image through a non-contact image detector, a user gesture expressed by the gesture image may have detector noise, so that the determined user gesture tends to have instability, ambiguity and ambiguity; in addition, when the user performs gesture operation, due to the fact that human factors inevitably cause unintended actions such as muscle shaking and the like, the determined gesture of the user has inaccuracy; therefore, the embodiment of the invention can process the original gesture position characteristics in the user gesture image through the adaptive interval Kalman filtering, thereby filtering the influence of the noise of the detector and the muscle jitter of the human hand on the user gesture.

Optionally, the model of the adaptive interval kalman filter may be represented as follows:

wherein,

the state vector is n multiplied by 1 at the moment k, and variables of human hand speed and human hand acceleration are introduced into the state vector in order to enable Kalman filtering to better estimate human hand position data;

is a state transition matrix of n x n, which can be designed according to the relationship among displacement, velocity and acceleration;

is an nxl control output matrix which is determined by the gravity acceleration;

is the input vector of the input vector,

and

which represents the vector of the noise, is,

generally obey a gaussian distribution;

is the m × 1 measurement vector at time k: (

Of elements and

the same, measured at the moment k, such as the position, speed, acceleration and the like of the human hand in the XYZ direction,

is an m × n observation matrix;

it should be noted that, as mentioned above,

where Φ is the state transition matrix, in embodiments of the invention

The element containing position, speed and acceleration, namely phi is the coefficient of the variables such as position, speed and acceleration when the kinematic formula is satisfied; gamma is a constant matrix that inputs the system into a vector

The number of elements of (a) is output to be consistent with the state vector; h is a constant matrix which represents the relation between the measurement vector and the state vector; here, the value of band Delta III is introducedThe individual symbols represent unknown but bounded constant perturbation matrices.

Accordingly, gesture location feature is state x 'at time k'_kCan be expressed as follows:

x′_k＝[p_x,k,V_x,k,A_x,k,p_y,k,V_y,k,A_y,k,p_z,k,V_z,k,A_z,k]

wherein p is_x,k，p_y,k，p_z,kFor the coordinates of the human hand at time k in space in three axes XYZ, V_x,k，V_y,k，V_z,kVelocity of the hand in XYZ direction at time k, A_x,k，A_y,k，A_z,kThe acceleration of the human hand in the XYZ direction is time k. Because the adaptive interval Kalman filtering is an estimator, the position of the current moment can be more accurately estimated by utilizing the gesture coordinate, the gesture speed and the acceleration of the previous moment;

in this process, the noise vector can be expressed as: w'_k＝[0,0,w'_x,0,0,w'_y,0,0,w_z]^TWherein (w'_x,w'_y,w_z) The process noise of the palm acceleration (which may be the noise not conforming to the whole acceleration change rule of the gesture) is obtained, and the noise vector can be filtered in a model of the adaptive interval Kalman filtering; thus state x 'featuring gesture position at time k-1'_k-1(as in a model formula

) Noise vector (as)

) The model of the adaptive interval Kalman filtering is adopted for processing, noise can be filtered, and the target gesture position characteristics of the muscle jitter moment k can be eliminated.

It can be seen that the gesture acceleration change rule can be determined according to the acceleration corresponding to the gesture position feature, so that the noise deviating from the gesture acceleration change rule is filtered through the model of the adaptive interval Kalman filtering, and the gesture coordinate, the gesture speed and the acceleration at the current moment are estimated according to the gesture coordinate, the gesture speed and the acceleration at the previous moment in the gesture position feature after the noise is filtered by utilizing the model of the adaptive interval Kalman filtering, so that the target gesture position feature at the current moment is determined;

and furthermore, the accuracy of the target gesture position characteristics fused by the adaptive interval Kalman filtering is improved, and the target gesture position characteristics can be used for performing coarse control operation on the service robot (as the user cannot accurately move the hand at the millimeter level precision without the help of foreign objects, the coarse control operation is performed on the service robot).

Optionally, for a means for processing a gesture posture feature according to an improved particle filter, fig. 8 is a flowchart illustrating a method for processing a gesture posture feature according to an improved particle filter, where the method may be executed by a human-computer interaction terminal, and referring to fig. 8, the method may include:

and S400, acquiring the rotation angle of the human hand represented by the gesture posture characteristic on each axis of the three-dimensional coordinate system.

And S410, determining quaternion components according to the rotation angles of the human hand on all axes of the three-dimensional coordinate system.

Optionally, the quaternion algorithm can be used to estimate the direction of the rigid body, and may be used to calculate the quaternion component; the quaternion component is a group of hypercomplex numbers and can describe the posture of a rigid body in space, and in the embodiment of the invention, the quaternion component can refer to the posture of a human hand; accordingly, the embodiment of the present invention may determine, through the original gesture posture features extracted from the gesture image, the rotation angle of the human hand in each axis of the three-dimensional coordinate system (the rotation angle may be one of the information included in the original gesture posture features), and further determine the corresponding quaternion component by using a quaternion algorithm.

And step S420, determining the posterior probability of the human hand particles according to the improved particle filtering.

Optionally, in order to reduce errors caused by using a quaternion algorithm, improved particle filtering is used to enhance data fusion (the fusion is the posture data of each particle used for expressing a human hand, and the improved particle filtering algorithm can select a better importance density function or optimize a resampling process so as to obtain accurate human hand posture data); the particle filtering is improved by adopting a Markov chain Monte Carlo method to process the resampled particles, so that the diversification of the particles is improved, the local convergence phenomenon of the standard particle filtering is avoided, and the accuracy of data estimation is improved.

And S430, iteratively processing the quaternion component according to the posterior probability to obtain a target quaternion component so as to obtain target gesture attitude characteristics.

Optionally, the target quaternion component may approximate a quaternion component of the real hand gesture.

Optionally, when the posterior probability of the human hand particle is determined, at t_kAt that moment, an approximation of the posterior probability of a human hand particle can be defined as:

wherein x is_i,kIs t_kThe ith state particle at time, N is the number of samples, ω_i,kIs t_kThe standard weight of the ith particle at time, δ being the dirac function; x is the number of_kMay be a human hand state, and in the embodiment of the present invention, 4 elements of a quaternion may be used to represent the posture of the human hand.

Therefore, the human hand particles (namely the quaternion component of the original human hand posture characteristic) can be calculated in an iterative manner through the posterior probability of the human hand particles, so that the state of the particles is more and more approximate to a true value, and a true three-dimensional gesture posture (namely a target gesture posture characteristic) is obtained;

the specific iteration mode can be as follows:

wherein K_kIs the Kalman gain, z_kIs an observed value, h is an observation operator, v_i,kIs the particle at t_kThe observation error of the ith state at the moment;

the rigid body attitude is represented using quaternions (the quaternion component is calculated to obtain the rigid body attitude), at t_k+1The quaternion component of each particle at a moment can be expressed as follows, so that the gesture feature of the target gesture is obtained;

where ω represents angular velocity and t is the sample time.

According to the embodiment of the invention, the original gesture posture characteristics extracted from the gesture image of the user can be processed by improving the particle filtering, so that the estimation accuracy of the gesture posture characteristics is greatly improved, and the method can be used for performing coarse control operation on the service robot.

It should be noted here that the weight calculation of the particles needs to be performed by combining the position estimation result of the kalman filter, and there is a certain correlation between the position and the attitude of the three-dimensional gesture data in the runaway; namely, the speed and the acceleration of the gesture have directionality, the direction needs to be calculated by a body coordinate system determined by the gesture, the superposition amount of the position of the gesture in the three-dimensional direction needs to be estimated by the gesture, and therefore by combining with the adaptive interval Kalman filtering, the precision of data estimation can be improved through the space-time constraint of the position and the gesture. The accurate position data can better calculate the particle weight so as to obtain accurate attitude data, and the accurate attitude data can better estimate the position data through speed and acceleration, so that the hand position and the attitude data are processed and fused through adaptive interval Kalman filtering and improved particle filtering, the three-dimensional gesture feature of the user can be better estimated, and the accuracy and the robustness of the determined gesture feature of the user are improved.

Optionally, further, after the target gesture position feature and the target gesture posture feature are fused and the gesture feature of the user is determined, the gesture feature which is not intended to be represented by the user can be filtered by a damping method, and the accuracy of gesture recognition is further improved by introducing a virtual spring coefficient; the method can be specifically realized by the following formula:

wherein F is robot control command input, wherein k is a virtual spring coefficient, D is a moving distance of a human hand, and tau is an elastic limit threshold, and when D is larger than tau, the robot does not respond to the three-dimensional gesture input; after the gesture features of the user are determined, if the moving distance of the hand corresponding to the gesture features of the user is greater than the set elasticity limit threshold, the gesture features need to be filtered; considering that the hand position may move violently (different from muscle jitter, which refers to frequent movement of the hand in a large range) in the interaction process, the three-dimensional gesture data at this time is unintended input data, so the data are filtered out, and the stability of the system is maintained;

correspondingly, when the gesture control instruction corresponding to the gesture feature is determined, the gesture control instruction corresponding to the unfiltered gesture feature can be determined.

An alternative application example of an embodiment of the invention may be as follows:

the user speaks a voice for controlling the service robot to move towards the direction to the man-machine interaction terminal and makes a pointing gesture; after acquiring voice transmitted by a user, a human-computer interaction terminal converts the voice into characters, extracts text features of the characters and determines text feature vectors corresponding to the text features, and determines a voice sample matched with the text feature vectors through a maximum entropy classification model so as to determine a voice control command corresponding to the voice sample and related to execution movement;

meanwhile, the man-machine interaction terminal acquires gesture position characteristics and gesture posture characteristics of a user gesture image, processes the gesture position characteristics according to adaptive interval Kalman filtering, processes the gesture posture characteristics according to improved particle filtering, fuses the processed gesture position characteristics and the gesture posture characteristics to determine the gesture characteristics of the user, and can determine a gesture control instruction related to the moving direction based on the gesture characteristics;

the man-machine interaction terminal can control the service robot to move in the direction indicated by the user according to the determined voice control command and the gesture control command; namely, the operation instruction which can be obtained by the man-machine interaction terminal through the natural language of the user is 'moving', and the moving direction is the direction of the finger of the user.

In the human-computer interaction process, the user can combine voice and gestures, so that the communication between the user and the service robot can be similar to the communication between the users, the human-computer interaction is very convenient and direct, the naturalness and the intelligence of the human-computer interaction are improved, the communication threshold of the human-computer interaction of the user is reduced, and powerful support is provided for the popularization of the human-computer interaction.

Optionally, in some human-computer interaction scenarios, the service robot often needs to operate a target object in the environment scenario according to user control, and if the user instructs the service robot to "pick up a cup on the ground", the service robot needs to identify the target object of "cup" in the environment scenario, and does not need the user to tell the robot which is the "cup", where the "cup" is located, and the like, so that the service robot can autonomously identify the cup in the environment scenario and perform the "picking up" operation; it is clear that the service robot has a certain autonomy due to the cognition on the environment, and a user is very simple in the control process, so that the target object in the environment scene is accurately identified, and the naturalness and the intelligence of man-machine interaction are favorably improved.

Optionally, fig. 9 shows a flow of a target object identification method provided in an embodiment of the present invention, where the method is applicable to a human-computer interaction terminal, and referring to fig. 9, the method may include:

and S500, acquiring an environment scene image.

Optionally, in the embodiment of the present invention, the environmental scene image may be acquired through an image acquisition device such as a camera preset on the service robot; the environment scene image may be considered as an image of an environment scene in which the service robot is located;

optionally, if the human-computer interaction terminal interacts with the service robot through the internet, the human-computer interaction terminal may obtain an environment scene image acquired by the service robot through the internet; if the service robot is internally provided with the human-computer interaction terminal, the human-computer interaction terminal can acquire the image acquisition device of the service robot and the acquired environment scene image.

And step S510, determining HOG characteristics of the environment scene image.

Optionally, in the embodiment of the present invention, an image feature in the environmental scene image may be described using a Histogram of Oriented Gradients (HOG) feature; obviously, the HOG feature is only one alternative embodiment of the image feature, and other image features may be adopted in the embodiment of the present invention.

The HOG is mainly used for calculating the statistic value of the local image gradient direction information. An advantage of HOG over other feature descriptors is that its algorithmic operation is performed at the local cell level of the image, making it well geometrically and optically invariant.

Describing image characteristics in the environment scene image by using the HOG characteristics, namely dividing the environment scene image into a certain number of sub-images, and dividing each sub-image into cell units according to a certain rule; then, for each sub-image, the gradient direction histogram (namely HOG characteristic) of each pixel point in the cell unit can be collected, and the density of each gradient direction histogram in the sub-image is calculated, so that normalization processing is carried out on each cell unit in the sub-image according to the density; and finally, combining the normalization results of the sub-images to determine the HOG characteristics of the environment scene image.

And step S520, extracting target keywords in the voice information conveyed by the user.

Optionally, the target keyword may be a keyword of a target object to be recognized in an environmental scene, and is carried in the voice information of the user; alternatively, the target keyword is generally in the form of a noun (a noun of a target object to be manipulated in the environmental scene, etc.), and follows or is associated with the action word in the voice message.

Step S530, according to the pre-trained target classification model, HOG features corresponding to the target keywords are matched from the HOG features of the environmental scene image.

And step 540, determining the object corresponding to the HOG characteristics matched in the environment scene image as the identified target object.

Alternatively, the target object may be regarded as an object operated by the service robot based on the user control, and may be an object for which the target control instruction is executed.

In the process, the target classification model can represent the HOG characteristics corresponding to each object, and the training and learning of the target classification model are important for the accuracy and efficiency of target identification; in this case, the embodiment of the present invention may adopt a deep learning method to train a target classification model; the deep learning is to develop learning from unlabeled data, which is more close to the learning mode of human brain, and the concept can be mastered by self after training; in the face of mass data, the deep learning algorithm can achieve the purpose that the traditional artificial intelligence algorithm cannot achieve, and the output result is more accurate along with the increase of data processing capacity. This will greatly improve the efficiency of computer processing information; according to different established network structures, the deep learning training method has great difference; in order to enable the robot to complete online learning in a short time and train to obtain a target classification model, the embodiment of the invention adopts a two-stage method to learn;

optionally, for any object, the embodiment of the present invention may determine a candidate set by using a reduced feature set (referred to as a first feature set) including an image feature of the object, and then arrange features in the candidate set (an arrangement manner may be implemented from large to small according to a feature value of an HOG feature, and the like, and a specific arrangement rule may not be strictly limited) by using a larger and more reliable feature set (referred to as a second feature set) including an image feature of the object, that is, the image feature of the object included in the second feature set is greater than the image feature of the object included in the first feature set, so as to select a feature with a set ordinal arranged in the candidate set as a training feature of the object, so as to obtain a training feature of the object; when any object is subjected to the processing, the training characteristics of each object can be obtained, and then the target classification model is obtained according to the training characteristics of each object.

Optionally, in the human-computer interaction, the robot may recognize an unknown object by using experience knowledge of a user, or correct a recognition error, which requires establishing a training model with tagged data, and may update the learning network parameters of the robot. Under the cooperation of users, on one hand, the robot can better know the characteristics (Features) of unknown objects through the description of the users; on the other hand, the robot can correctly recognize the object (group-route) through the shared experience of the user;

in the learning process, parameters which enable the recognition accuracy of the system to be optimal are calculated; here, data for correcting the robot parameters input in the user assistance process is taken as the learning network parameter feature value (Features) and the tag data (Ground-route) of the robot, and the learning network parameters of the robot are updated based on the feature value and the tag data.

According to the man-machine interaction method provided by the embodiment of the invention, a user can transmit control information to the man-machine interaction terminal through voice or a voice and gesture combined form, the man-machine interaction mode of the user can be similar to the communication between the users, and the man-machine interaction is very convenient, fast and direct; meanwhile, the man-machine interaction terminal can be combined with the environment scene of the service robot to identify the target object, and the user does not need to further explain the operated target object in the transmitted control information, so that the man-machine interaction process of the user is very simple; therefore, the human-computer interaction method provided by the embodiment of the invention improves the naturalness and intelligence of human-computer interaction, reduces the communication threshold of human-computer interaction of users, and provides powerful support for popularization of human-computer interaction.

In the following, the man-machine interaction device provided by the embodiment of the present invention is introduced, and the man-machine interaction device described below may be regarded as a man-machine interaction terminal, which is a program module required to implement the man-machine interaction method provided by the embodiment of the present invention. The contents of the human-computer interaction device described below and the contents of the human-computer interaction method described above may be referred to in correspondence with each other.

Fig. 10 is a block diagram of a human-computer interaction device according to an embodiment of the present invention, where the human-computer interaction device is applicable to a human-computer interaction terminal, and referring to fig. 10, the method may include:

a control information obtaining module 100, configured to obtain control information conveyed by a user, where the control information includes voice information;

a text feature extraction module 200, configured to extract a text feature of the voice information;

a text feature vector determining module 300, configured to determine a text feature vector corresponding to the text feature;

a voice sample determination module 400, configured to determine, according to a pre-trained voice classification model, a voice sample matched with the text feature vector; the voice classification model represents the attribution probability of a text feature vector and a corresponding voice sample;

a voice instruction determining module 500, configured to use a voice control instruction corresponding to the determined voice sample as a voice control instruction of the voice information;

and a target instruction generating module 600, configured to generate a target control instruction according to the voice control instruction.

Optionally, the voice sample determining module 400 is configured to determine, according to a pre-trained voice classification model, a voice sample matched with the text feature vector, and specifically includes:

determining voice samples to which the text feature vectors are possibly attributed and attribution probability of each voice sample to which the text feature vectors are possibly attributed according to the voice classification model;

and selecting the voice sample with the highest attribution probability as the voice sample matched with the text feature vector.

Optionally, fig. 11 shows another structural block diagram of the human-computer interaction device provided in the embodiment of the present invention, and as shown in fig. 10 and fig. 11, the device may further include:

the speech classification model training module 700 is configured to obtain a training corpus, where the training corpus records speech samples of each speech control instruction, and each speech control instruction corresponds to at least one speech sample; extracting text features of each voice sample to obtain a plurality of text features; respectively weighting the feature vectors of the text features to obtain the text feature vectors of the text features; and modeling the attribution probability of each text feature vector and the corresponding voice sample according to a machine learning algorithm to obtain a voice classification model.

Optionally, the speech classification model training module 700 is configured to perform feature vector weighting on each text feature respectively to obtain a text feature vector of each text feature, and specifically includes:

for a text feature, determining the occurrence frequency of words of the text feature in a corresponding voice sample and the occurrence frequency of the words in a training corpus;

determining the importance degree of the text feature in the corresponding voice sample according to the occurrence frequency of the words of the text feature in the corresponding voice sample and the training corpus; the importance degree is in direct proportion to the occurrence frequency of the words of the text characteristics in the voice sample, and is in inverse proportion to the occurrence frequency of the words of the text characteristics in the corpus;

and determining a text feature vector corresponding to the text feature according to the importance degree.

Optionally, the speech classification model training module 700 is configured to model, according to a machine learning algorithm, the attribution probability of each text feature vector and the corresponding speech sample to obtain a speech classification model, and specifically includes:

and modeling the attribution probability of each text feature vector and the corresponding voice control instruction by using a maximum entropy algorithm to obtain a maximum entropy classification model with uniform probability distribution.

Optionally, the embodiment of the present invention may further perform human-computer interaction in combination with the user gesture, and correspondingly, the control information may further include gesture information; the gesture information may include: extracting gesture position features and gesture posture features from the user gesture image;

optionally, fig. 12 shows another structural block diagram of the human-computer interaction device provided in the embodiment of the present invention, and in combination with fig. 10 and 12, the device may further include:

the adaptive interval kalman filtering processing module 800 is configured to process the gesture position feature according to adaptive interval kalman filtering to obtain a target gesture position feature;

the improved particle filter processing module 900 is configured to process the gesture posture feature according to an improved particle filter to obtain a target gesture posture feature;

a gesture feature determination module 1000, configured to fuse the target gesture position feature and the target gesture posture feature, and determine a gesture feature of the user;

a gesture control instruction determining module 1100, configured to determine a gesture control instruction corresponding to the gesture feature;

correspondingly, the target instruction generating module 600 is configured to generate a target control instruction according to the voice control instruction, and specifically includes:

and generating a target control instruction according to the voice control instruction and the gesture control instruction.

Optionally, the adaptive interval kalman filtering processing module 800 is configured to process the gesture position feature according to the adaptive interval kalman filtering to obtain a target gesture position feature, and specifically includes:

determining a gesture acceleration change rule according to the acceleration corresponding to the gesture position characteristics;

filtering noise deviating from the gesture acceleration change rule according to a model of the adaptive interval Kalman filtering;

and estimating the gesture coordinate, the gesture speed and the acceleration of the current moment according to the gesture coordinate, the gesture speed and the acceleration of the previous moment in the gesture position characteristics after the noise is filtered by utilizing the model of the adaptive interval Kalman filtering, and determining the target gesture position characteristics of the current moment.

Optionally, the improved particle filter processing module 900 is configured to process the gesture posture feature according to the improved particle filter to obtain a target gesture posture feature, and specifically includes:

acquiring the rotation angle of the human hand represented by the gesture posture characteristics on each axis of a three-dimensional coordinate system;

determining quaternion components according to the rotation angles of the human hand on all axes of a three-dimensional coordinate system;

determining the posterior probability of the human hand particles according to the improved particle filtering;

and iteratively processing the quaternion component according to the posterior probability to obtain a target quaternion component so as to obtain the gesture characteristic of the target gesture.

Optionally, the target instruction generating module 600 is configured to generate a target control instruction according to the voice control instruction and the gesture control instruction, and specifically includes:

determining a voice control variable corresponding to the voice control instruction, wherein the voice control variable comprises: the operation direction key word, the operation value corresponding to the operation key word and the operation unit which are indicated by the voice control instruction; determining a gesture control variable corresponding to the gesture control instruction;

and combining the voice control variable and the gesture control variable to form a control vector for describing a target control instruction.

Optionally, fig. 13 shows another structural block diagram of the human-computer interaction device provided in the embodiment of the present invention, and in combination with fig. 12 and 13, the device may further include:

the gesture feature filtering module 1200 is configured to filter a gesture feature of the user if a moving distance of a human hand corresponding to the gesture feature is greater than a set elasticity limit threshold;

accordingly, the gesture control instruction determination module 1100 may be configured to determine a gesture control instruction corresponding to the unfiltered gesture feature.

Optionally, fig. 14 shows yet another structural block diagram of the human-computer interaction device provided in the embodiment of the present invention, and as shown in fig. 10 and fig. 14, the device may further include:

a target object recognition module 1300 configured to obtain an image of an environmental scene; determining image features of the environmental scene image; extracting target keywords in voice information conveyed by a user; matching image features corresponding to the target keywords from the image features of the environmental scene images according to a pre-trained target classification model; the target classification model represents image characteristics corresponding to each object; determining an object corresponding to the matched image characteristics in the environment scene image as an identified target object; the target object is an object for which the target control instruction is executed.

Optionally, the training of the target classification model is implemented by a target classification model training module shown in fig. 15, fig. 15 shows another structural block diagram of the human-computer interaction device provided in the embodiment of the present invention, and with reference to fig. 14 and fig. 15, the device may further include:

the target classification model training module 1400 is configured to, for any object, determine a candidate set by using a first feature set that includes image features of the object, arrange features in the candidate set by using a second feature set that includes image features of the object, and select features of a set ordinal arranged in the candidate set as training features of the object to obtain training features of each object; wherein, the second feature set comprises more image features of the object than the first feature set; and training according to the training characteristics of each object to obtain a target classification model.

Optionally, the human-computer interaction device provided in the embodiment of the present invention may further be configured to:

taking data which are input by a user and used for correcting the parameters of the robot as the characteristic values and label data of the learning network parameters of the robot; and updating the learning network parameters of the robot according to the characteristic values and the label data.

Alternatively, the module architecture of the human-computer interaction device described above may be loaded in the human-computer interaction terminal in the form of a program. The structure of the human-computer interaction terminal can be shown in fig. 3, and includes: at least one memory and at least one processor;

wherein the memory stores a program, and the processor calls the program; the program is for:

extracting text features of the voice information;

determining a text feature vector corresponding to the text feature;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A human-computer interaction method, comprising:

extracting text features of the voice information;

determining a text feature vector corresponding to the text feature;

generating a target control instruction according to the voice control instruction;

acquiring an environment scene image where a service robot is located;

dividing the environment scene image into a certain number of sub-images, and dividing each sub-image into cell units according to a certain rule; for each subimage, the gradient direction histogram of each pixel point in the cell unit can be collected, the density of each gradient direction histogram in the subimage is calculated, and normalization processing is carried out on each cell unit in the subimage according to the density; combining the normalization results of the sub-images to determine the HOG characteristics of the environment scene image;

extracting target keywords of a target object to be recognized in an environment scene in voice information conveyed by a user;

according to a pre-trained target classification model, matching HOG features corresponding to target keywords from HOG features of an environmental scene image;

and determining an object corresponding to the matched HOG feature in the environment scene image as the identified target object, wherein the target object is the object for which the target control instruction is executed.

2. The human-computer interaction method of claim 1, wherein the determining the text feature vector matched speech samples according to a pre-trained speech classification model comprises:

3. The human-computer interaction method according to claim 1 or 2, further comprising:

acquiring a training corpus, wherein the training corpus records voice samples of all voice control instructions, and one voice control instruction corresponds to at least one voice sample;

extracting text features of each voice sample to obtain a plurality of text features;

respectively weighting the feature vectors of the text features to obtain the text feature vectors of the text features;

and modeling the attribution probability of each text feature vector and the corresponding voice sample according to a machine learning algorithm to obtain a voice classification model.

4. The human-computer interaction method according to claim 3, wherein the weighting the feature vectors of the text features to obtain the text feature vectors of the text features comprises:

5. The human-computer interaction method of claim 3, wherein the modeling the attribution probability of each text feature vector and the corresponding voice sample according to a machine learning algorithm to obtain the voice classification model comprises:

6. The human-computer interaction method according to claim 1, wherein the control information further comprises gesture information; the gesture information includes: extracting gesture position features and gesture posture features from the user gesture image;

the method further comprises the following steps:

processing the gesture position characteristics according to adaptive interval Kalman filtering to obtain target gesture position characteristics; processing the gesture attitude characteristics according to the improved particle filtering to obtain target gesture attitude characteristics;

fusing the target gesture position characteristic and the target gesture attitude characteristic to determine the gesture characteristic of the user;

determining a gesture control instruction corresponding to the gesture feature;

the generating a target control instruction according to the voice control instruction comprises:

7. The human-computer interaction method according to claim 6, wherein the processing the gesture position features according to adaptive interval Kalman filtering to obtain target gesture position features comprises:

8. The human-computer interaction method of claim 6, wherein the processing the gesture posture feature according to the improved particle filtering to obtain a target gesture posture feature comprises:

9. The human-computer interaction method according to claim 6, wherein the generating a target control instruction according to the voice control instruction and the gesture control instruction comprises:

10. The human-computer interaction method of claim 1, further comprising:

for any object, determining a candidate set through a first feature set containing the image features of the object, arranging the features in the candidate set through a second feature set containing the image features of the object, and selecting the features with set sequence positions arranged in the candidate set as the training features of the object to obtain the training features of each object; wherein, the second feature set comprises more image features of the object than the first feature set;

and training according to the training characteristics of each object to obtain a target classification model.

11. The human-computer interaction method of claim 1, further comprising:

taking data which are input by a user and used for correcting the parameters of the robot as the characteristic values and label data of the learning network parameters of the robot;

and updating the learning network parameters of the robot according to the characteristic values and the label data.

12. A human-computer interaction device, comprising:

the target instruction generating module is used for generating a target control instruction according to the voice control instruction and acquiring an environment scene image where the service robot is located;

13. The human-computer interaction device of claim 12, wherein the control information further comprises gesture information; the gesture information includes: extracting gesture position features and gesture posture features from the user gesture image;

the device further comprises:

the adaptive interval Kalman filtering processing module is used for processing the gesture position characteristics according to adaptive interval Kalman filtering to obtain target gesture position characteristics;

the improved particle filtering processing module is used for processing the gesture attitude characteristics according to improved particle filtering to obtain target gesture attitude characteristics;

the gesture feature determination module is used for fusing the target gesture position feature and the target gesture posture feature and determining the gesture feature of the user;

the gesture control instruction determining module is used for determining a gesture control instruction corresponding to the gesture feature;

the target instruction generating module is configured to generate a target control instruction according to the voice control instruction, and specifically includes:

14. A human-computer interaction terminal, comprising: at least one memory and at least one processor;

extracting text features of the voice information;

determining a text feature vector corresponding to the text feature;

generating a target control instruction according to the voice control instruction,

acquiring an environment scene image where a service robot is located;

15. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a method of human-computer interaction as claimed in any one of claims 1 to 11.