CN106055244B

CN106055244B - Man-machine interaction method based on Kinect and voice

Info

Publication number: CN106055244B
Application number: CN201610306998.7A
Authority: CN
Inventors: 闵华松; 齐诗萌; 李潇; 林云汉; 吴凡
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE
Priority date: 2016-05-10
Filing date: 2016-05-10
Publication date: 2020-08-04
Anticipated expiration: 2036-05-10
Also published as: CN106055244A

Abstract

The invention discloses a human-computer interaction method based on Kinect and voice, which comprises the following steps of 1) obtaining accurate spatial position and posture information of each object in a scene in a Kinect coordinate system K by adopting a Kinect sensor to complete target detection and recognition, 2) obtaining three-dimensional point cloud data by fusing depth images and RGB (red, green and blue) acquired by Kinect respectively, 3) identifying a spatial point cloud object, processing the three-dimensional point cloud data to obtain a semantic description file, 4) carrying out coordinate transformation on an object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R, 5) receiving voice input of a user, processing input signals to obtain text information, and 6) inputting the text information and an XM L semantic map into an intelligent inference machine, wherein the inference machine generates an execution instruction and outputs text information of response and guide information of the user.

Description

Man-machine interaction method based on Kinect and voice

Technical Field

The invention relates to the technical field of robots, in particular to a man-machine interaction method based on Kinect and voice.

Background

In a conventional human-computer interaction system, a graphical user interface based on windows, menus, icons and pointing devices is formed by adopting a WIMP interface, and information is input through keys, knobs or other touch devices. The interactive system can only provide limited options for people to select according to information preset by an interactive system designer, cannot interact with environmental information for a large amount of information, and needs manual input by operators, so that the interactive system needs to be operated by skilled workers in a service link and production and manufacturing. No matter how to optimize the structure or improve the guiding mode for the user, the use difficulty can be reduced, and the purpose of saving the labor cost by reducing the number of the working personnel can not be really achieved.

Relevant patents are found in literature search: an invention patent of 'a man-machine interaction method, device and robot' with application number CN201511016826.8 published in 2016, 3, 23, provides an interaction method based on voice and image information, and the system can determine the identity of a user through the voice information of the user and can judge the input of the user through the action of the user. An invention patent of 'catering service system' with application number CN201510658482.4, published on 3/23/2016, provides a human-computer interaction method for obtaining a user instruction based on a voice processing unit and obtaining a user position through a microphone array.

However, the above patent only relates to how to obtain the user information through the multimedia technology, but cannot obtain the scene information, and it is necessary to ensure that the interactive system is used in a specific scene, and once the scene changes greatly, the interactive system cannot respond or an execution error occurs.

Disclosure of Invention

The invention aims to solve the technical problem of providing a man-machine interaction method based on Kinect and voice aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a man-machine interaction method based on Kinect and voice comprises the following steps:

1) processing the three-dimensional point cloud data to obtain a position under a K coordinate system; the coordinate system K is established by taking a Kinect geometric center as an original point, taking a direction perpendicular to the outward direction of the lens as a Z-axis positive direction and taking a connecting line of circle centers of three lenses of Kinect as an X-axis;

2) fusing the depth image and the RGB respectively acquired by the Kinect to obtain three-dimensional point cloud data;

3) identifying the space point cloud object: processing the three-dimensional point cloud data to obtain a semantic description file;

4) carrying out coordinate transformation on the object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R; the object coordinate system O takes the geometric zhongxing of the point cloud as an origin, takes the longest line segment direction in the object passing through the origin as a Z axis, and a plane passing through the origin and perpendicular to the Z axis is an XY plane; the coordinate system R takes the ground as an XY plane, the projection of the geometric center of the mechanical arm base on the XY plane as an origin, the upward direction perpendicular to the ground passing through the origin is the positive direction of a Z axis, and Y axes are all parallel to a Y axis of the K coordinate system;

5) receiving voice input of a user, and processing an input signal to obtain text information;

6) the textual information and XM L semantic map are input to an intelligent inference engine that generates execution instructions and outputs textual information for the user's response and guidance information.

According to the scheme, the space point cloud object identification process in the step 3) comprises preprocessing, key point extraction and descriptor extraction, and then feature matching is carried out through an object feature database, and finally a semantic description file is obtained.

According to the scheme, in the step 3):

3.1) preprocessing, wherein the preprocessing step is used for filtering point cloud data too far away or too close to a sensor;

3.2) detecting the characteristic points of the point cloud data by adopting an ISS algorithm, wherein the specific process is as follows:

3.2.1) Inquiry of each point p in the input point cloud data_iRadius r_frameAll points p within_jAnd calculating the weight according to formula 1;

W_ij＝1/||p_i-p_j||，|p_i-p_j|＜r_frame(1)

3.2.2) calculating the covariance matrix according to equation 2 based on the weights

3.2.3) computing eigenvalues of the covariance matrix

Arranging the characteristic values in a descending order;

3.2.4) setting the ratio threshold γ₂₁And gamma₃₂Is reserved to satisfy

And

the points are key feature points;

3.3) calculating the feature descriptors of the key feature points by the following specific method:

constructing L RF a unique, definite and stable local reference coordinate system by computing covariance matrix of points located on local surface in the neighborhood of the keypoint, and rotating the local surface until L RF is aligned with Ox, Oy and Oz axes of object coordinate system O with the keypoint as a starting point, so that the points have rotation invariance;

then for each axis Ox, Oy, Oz we perform the following steps, we take these axes as the current axis:

3.3.1) the local surface is rotated around the current axis by a specified angle;

3.3.2) the rotated local surface points are projected onto the XY, XZ and YZ planes;

3.3.3) establishing a projection distribution matrix which only displays the number of points contained in each subdomain, wherein the number of subdomains represents the dimension of the matrix and is a parameter of the algorithm as well as the specified angle;

3.3.4) calculating the center-to-center distance of the distribution matrix, i.e., μ₁₁、μ₂₁、μ₁₂、μ₂₂And e;

3.3.5) cascading the calculated values to form sub-features;

the above steps are performed cyclically, the number of iterations depending on the number of given rotations; finally, cascading the sub-features of different coordinate axes to form a final RoPS descriptor;

3.4) matching the characteristic values, wherein the specific method comprises the following steps:

in the patent, a threshold-based feature matching method is used, and in a threshold-based matching mode, if the distance between two descriptors is smaller than a set threshold, it is indicated that the two features are matched consistently.

The distance formula used for the threshold is to characterize the difference between two clusters of objects (a cluster is made up of multiple descriptor sets), i.e., the sum of the Manhattan distances of the geometric centers of the two sets plus the standard deviation of each of their dimensions is as shown in equations 3 and 5:

D(A,B)＝L₁(C_A,C_B)+L₁(std_A,std_B) (3)

wherein D (A, B) represents the distance difference between two object clusters, namely A and B, C_A(i),C_B(i) A, B center of a certain dimension, L1 represents Manhattan distance formula, std_A(i) Represents the standard deviation, std, of a certain dimension of cluster A_B(i) Representing the standard deviation of a certain dimension of the cluster B;

l of two descriptors a and b₁The distances are as follows:

where n represents the size of the feature descriptor, dimension 135 of the RoPS;

a_j(i) a value representing the i-dimension of the RoPS descriptor for the jth keypoint in the a cluster;

| A | represents the number of key points in the cluster A;

| B | represents the number of keypoints in cluster B.

According to the scheme, in the step 4), a proper position is selected to place the mechanical arm, a coordinate system R is established, the coordinate of the origin of a coordinate system K in the coordinate system R is (d, l, h), an object coordinate system O is established by using a PCA method, the posture of the object is obtained through two times of coordinate system transformation from the coordinate system O to the coordinate system K and then to the coordinate system R, the coordinate transformation is carried out on the coordinate system K to obtain the posture information under the coordinate system R, the posture information corresponding to the semantic description file under the coordinate system R is solved, and the XM L semantic map is reproduced.

According to the scheme, the voice recognition process in the step 5) specifically comprises the following steps:

5.1) pretreatment: collecting user voice information through a microphone array, processing an input original voice signal, filtering unimportant information and background noise, and performing endpoint detection, voice framing and pre-emphasis processing on the voice signal;

5.2) feature extraction: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;

5.3) carrying out acoustic model modeling by adopting a Hidden Markov Model (HMM), and matching the voice to be recognized with the acoustic model in the recognition process so as to obtain a recognition result;

and 5.4) carrying out grammar and semantic analysis on the training text database, and training based on a statistical model to obtain an N-Gram language model, thereby improving the recognition rate and reducing the search range.

5.5) aiming at the input voice signal, establishing a recognition network according to the trained HMM acoustic model, language model and dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, thereby determining the characters contained in the voice sample.

The invention has the following beneficial effects: the defect that the limited range of the product position is too small in the traditional automatic equipment is overcome by identifying the position of the object; meanwhile, the combination of the voice and the object position information can be applied to the service industry;

drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a kinect sensor model; and a K coordinate system schematic;

FIG. 2 is a schematic illustration of a K coordinate system and ground contrast;

FIG. 3 is an overall flow chart of object recognition;

FIG. 4 is a feature description sub-flowchart;

FIG. 5 is a diagram illustrating the relationship between a K coordinate system and an R coordinate system;

FIG. 6 is an overall flow chart of object pose determination;

FIG. 7 is an overall flow chart of voice interaction;

fig. 8 is a block diagram of the whole system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a man-machine interaction method based on Kinect and voice includes the following two parts:

a first part of scene interaction, comprising the steps of:

firstly, correctly placing a Kinect and establishing a K coordinate system;

the Kinect is placed right opposite to the object, the detection range of the Kinect is 1.8-3.6 meters, the horizontal visual field is 53 degrees, the vertical visual field is 47 degrees, and the object is arranged corresponding to the object to guarantee that the Kinect can correctly acquire data within the range. Then, a coordinate system K with the center of Kinect as the origin is established as shown in FIG. 1, and Kinect is related to the ground as shown in FIG. 2, wherein the z-axis forms an angle θ with the horizontal plane.

Step two, the Kinect sensor completes target detection and identification;

respectively acquiring a depth image and RGB (red, green and blue) by Kinect, and obtaining three-dimensional point cloud data after fusion processing;

firstly, point cloud data which is too far away or too close to the sensor is filtered through pretreatment, so that the calculation cost can be effectively reduced, the processing speed is increased, and the real-time performance of the system is improved.

After preprocessing, ISS algorithm is selected for feature point detection. And then the detected feature points are characterized by an S/C-RoPS algorithm. And then carrying out feature matching through an object feature database to obtain a semantic description file for identifying the object.

The flow of point cloud data acquisition is shown in fig. 3.

The three steps of extracting key points, calculating feature descriptors and 3D feature matching are described in detail below.

The specific process of extracting the key points is as follows:

(1) inquiring each point p in input point cloud data_iRadius r_frameAll points inside, and calculating the weight according to equation 1

W_ij＝1/||p_i-p_j||，|p_i-p_j|＜r_frame(1)

(2) Calculating a covariance matrix according to formula 2 based on the weights;

(3) computing eigenvalues of a covariance matrix

Arranging the characteristic values in a descending order;

(4) setting a ratio threshold gamma₂₁And gamma₃₂Reserve to meet

And

the points are key feature points.

The calculation method of the feature descriptor is as follows:

a unique, unambiguous and stable local reference coordinate system (L RF) is first constructed by computing covariance matrices of points located on local surfaces in the neighborhood of keypoints, rotating the local surfaces with the keypoints as starting points until L RF is aligned with the Ox, Oy and Oz axes, which makes the points rotation invariant, and then for each axis Ox, Oy, Oz we perform the following steps, we take these axes as current axis:

1) rotating the local surface around the current axis by a specified angle;

2) projecting the rotated local surface point onto XY, XZ and YZ planes;

3) establishing a projection distribution matrix which only displays the number of points contained in each subdomain, wherein the number of the subdomains represents the dimension of the matrix and is a parameter of the algorithm as well as a specified angle;

4) calculating the center-to-center distance, i.e. mu, of the distribution matrix₁₁、μ₂₁、μ₁₂、μ₂₂And e;

5) the computed values are concatenated to form sub-features.

The loop performs these several steps a number of times, the number of iterations depending on the number of given rotations. And finally, cascading the sub-features of different coordinate axes to form a final RoPS descriptor.

The shape or color information of the local surface is added into the RoPS, the coded information is expanded and improved, an S/C-RoPS descriptor is generated, a block diagram of the algorithm is shown in FIG. 4, and the accuracy of feature matching is optimized.

The method adopts a confidence-based decision-making layer fusion algorithm to perform data information fusion on the S-RoPS descriptor and the C-RoPS descriptor. The specific idea is that an S-RoPS descriptor or a C-RoPS descriptor is independently used for object recognition, so that the highest confidence coefficient under each single-mode method can be obtained, and the fusion strategy is to compare the confidence coefficients of all candidate model results generated by two independent methods and select the candidate model with the highest confidence coefficient.

The characteristic value matching method comprises the following steps:

a threshold-based feature matching method is used in this patent. In the threshold-based matching mode, if the distance between the two descriptors is smaller than a set threshold, the two features are consistent and matched.

D(A,B)＝L₁(C_A,C_B)+L₁(std_A,std_B) (3)

std_Bis calculated and std_ASimilarly, n represents the size of the feature descriptor

L of two descriptors a and b₁The distances are as follows:

and thirdly, selecting a proper position mechanical arm, establishing a coordinate system R, solving the pose under the K coordinate system, converting the position and the pose information under the K into the coordinate and the pose information under the coordinate system R through coordinate transformation and coordinate system transformation (an object coordinate system O is a temporary variable generated for solving the pose, has no practical meaning other than the point of an origin and is K to R instead of O to R), and producing the XM L semantic map.

Selecting a proper position to place the mechanical arm, establishing a coordinate system R as shown in FIG. 5, wherein the coordinate of the origin of the coordinate system K in the coordinate system R is (d, l, h), establishing an object coordinate system O by using a PCA method, and obtaining corresponding pose information under the R coordinate system through two times of coordinate system transformation and one time of coordinate transformation under the K coordinate system.

1) Calculating the geometric center of an object point cloud

i represents the number of points, decentralizing all point sets

Arranging the coordinates of all the points after the decentralization into a matrix of 3 × N

2) Let M be A. A^TAnd calculating the eigenvalue and eigenvector of M: lambda [ alpha ]_i·V_i＝M·V_iI 1,2,3, and normalizes the feature vector to | | | V_i1, the long axis direction of the object corresponds to the characteristic vector of the maximum characteristic value, and lambda is set₁≤λ₂≤λ₃Then, a rotation matrix of the object coordinate system relative to the coordinate system K can be obtained

The translation matrix is the geometric center of the object point cloud

The pose of the object coordinate system under the coordinate system K is as follows:

is provided with^camC＝{P_i}, then

And representing point clouds under a model library object coordinate system, determining short axis and secondary long axis planes according to the long axis and the central point, and determining the directions of the short axis and the secondary long axis according to the extreme value distribution of the plane points.

In the matching stage, in order to obtain a transformation matrix from an actual object to an object in a model library, a three-point method is adopted to calculate the pose of six degrees of freedom, and two corresponding three-dimensional points are integrated into a retaining pocket^modP}，{^objP, if the rigid body transformation relation is satisfied

Wherein

For the rotation matrix and translation vector of the two-point set, the least square method is used to solve the optimal solution to obtain the solution that minimizes E in the formula 8

And

then the transformation matrix from the actual object to the model library object is as follows:

the pose matrix of the actual object to the sensor coordinate system is as follows:

the rotation matrix can be converted into yaw α, pitch β, roll γ to describe its attitude as in equation 11, and the translation matrix can be converted into center coordinates to describe its position.

Wherein r is_ijRepresenting the corresponding elements of the i rows and j columns of the rotation matrix.

The relationship between the coordinate system R and the coordinate system K is shown in FIG. 5, and the transformation matrix of the two is shown in equation 12.

Where θ represents the tilt angle of Kinect with respect to the horizontal plane, { x, y, z } are coordinate values of the object in the coordinate system R, { x_k,y_k,z_kIs the coordinate value of the object under the coordinate system K。

The attitude matrix of the object to the coordinate system R is as follows;

wherein

The second part is voice man-machine interaction, which comprises the following steps:

step one, a user sends out a voice command, and the voice command is converted into text information after being processed.

After receiving the voice of the user, the text information is finally obtained through preprocessing and voice decoding, and the specific flow is as shown in fig. 7:

inputting the text information and the XM L semantic map into an intelligent inference machine, generating an execution instruction by the inference machine and outputting the text information;

a user constructs a semantic map file of a current scene through a voice control three-dimensional map real-time generation module, a voice recognition node and a voice synthesis node realize man-machine conversation through sending and receiving texts respectively, an intelligent inference machine node analyzes and feeds back information by combining the map file, a solution scheme which is expected by the user is perfected through deep conversation, and the solution scheme is finally generated and sent to a scheme analysis and motion planning module. The PocketSphnix open source speech recognition system is used for speech recognition, and the Ekho open source speech synthesis system is used for speech synthesis.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A man-machine interaction method based on Kinect and voice is characterized by comprising the following steps:

1) acquiring accurate spatial position and attitude information of each object in a scene in a coordinate system K by using a Kinect sensor to finish target detection and identification; the coordinate system K is established by taking a Kinect geometric center as an original point, taking a direction perpendicular to the outward direction of the lens as a Z-axis positive direction and taking a connecting line of circle centers of three lenses of Kinect as an X-axis;

3) identifying the space point cloud object: processing the three-dimensional point cloud data to obtain a semantic description file; the step 3) is as follows:

w_ij＝1/||p_i-p_j||，|p_i-p_j|＜r_frame(1)

3.2.3) computing eigenvalues of the covariance matrix

Arranging the characteristic values in a descending order;

And

the points are key feature points;

firstly, a unique, definite and stable local reference coordinate system L RF is constructed by calculating covariance matrixes of points on local surfaces in the neighborhood of key points, and the local surfaces are rotated until L RF is aligned with the Ox, Oy and Oz axes of an object coordinate system O by taking the key points as starting points to enable the points to have rotation invariance;

3.3.3) establishing a projection distribution matrix, wherein the matrix only displays the number of points contained in each subdomain, the number of subdomains represents the dimension of the matrix, and the number of points contained in each subdomain is a parameter calculated by a feature descriptor as well as a specified angle;

3.3.4) calculating the center distance mu of the distribution matrix₁₁、μ₂₁、μ₁₂、μ₂₂And e;

3.3.5) cascading the calculated values to form sub-features;

using a characteristic matching method based on a threshold value, and under a matching mode based on the threshold value, if the distance between the two descriptors is smaller than a set threshold value, indicating that the two characteristics are matched in a consistent manner;

the distance formula used for the threshold is the sum of the Manhattan distances characterizing the difference between two object clusters, i.e. the geometric centers of the two sets plus the standard deviation of each dimension, and the calculation formula adopts the following formula (3) and formula (5):

D(A,B)＝L₁(C_A,C_B)+L₁(std_A,std_B) (3)

whereinD (A, B) represents the distance difference between two object clusters, namely A and B, to each dimension i, C_A,C_BA, B center of a certain dimension, L1 represents Manhattan distance formula, std_A(i) Represents the standard deviation, std, of a certain dimension of cluster A_B(i) Representing the standard deviation of a certain dimension of the cluster B;

n represents the size of the feature descriptor;

l of two descriptors a and b₁The distances are as follows:

b_j(i) a value representing the i-dimension of the RoPS descriptor for the jth keypoint in the B cluster;

| A | represents the number of key points in the cluster A;

| B | represents the number of key points in the cluster B;

4) carrying out coordinate transformation on the object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R;

the step 4) is as follows:

selecting a proper position to place the mechanical arm, establishing a coordinate system R, wherein the coordinate of the origin of a coordinate system K in the coordinate system R is (d, l, h), establishing an object coordinate system O by using a PCA method, and obtaining the posture of the object through two times of coordinate system transformation from the coordinate system O to the coordinate system K and then to the coordinate system R;

2. The Kinect and voice-based human-computer interaction method as claimed in claim 1, wherein the step 5) voice recognition process specifically comprises the steps of:

5.3) performing acoustic model modeling by adopting a Hidden Markov Model (HMM), and matching the voice to be recognized with the acoustic model in the recognition process so as to obtain a recognition result;

5.4) carrying out grammar and semantic analysis on the training text database, and training based on a statistical model to obtain an N-Gram language model, thereby improving the recognition rate and reducing the search range;