CN106055244B - Man-machine interaction method based on Kinect and voice - Google Patents

Man-machine interaction method based on Kinect and voice Download PDF

Info

Publication number
CN106055244B
CN106055244B CN201610306998.7A CN201610306998A CN106055244B CN 106055244 B CN106055244 B CN 106055244B CN 201610306998 A CN201610306998 A CN 201610306998A CN 106055244 B CN106055244 B CN 106055244B
Authority
CN
China
Prior art keywords
coordinate system
voice
points
kinect
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610306998.7A
Other languages
Chinese (zh)
Other versions
CN106055244A (en
Inventor
闵华松
齐诗萌
李潇
林云汉
吴凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Science and Engineering WUSE
Original Assignee
Wuhan University of Science and Engineering WUSE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Science and Engineering WUSE filed Critical Wuhan University of Science and Engineering WUSE
Priority to CN201610306998.7A priority Critical patent/CN106055244B/en
Publication of CN106055244A publication Critical patent/CN106055244A/en
Application granted granted Critical
Publication of CN106055244B publication Critical patent/CN106055244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser

Abstract

The invention discloses a human-computer interaction method based on Kinect and voice, which comprises the following steps of 1) obtaining accurate spatial position and posture information of each object in a scene in a Kinect coordinate system K by adopting a Kinect sensor to complete target detection and recognition, 2) obtaining three-dimensional point cloud data by fusing depth images and RGB (red, green and blue) acquired by Kinect respectively, 3) identifying a spatial point cloud object, processing the three-dimensional point cloud data to obtain a semantic description file, 4) carrying out coordinate transformation on an object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R, 5) receiving voice input of a user, processing input signals to obtain text information, and 6) inputting the text information and an XM L semantic map into an intelligent inference machine, wherein the inference machine generates an execution instruction and outputs text information of response and guide information of the user.

Description

Man-machine interaction method based on Kinect and voice
Technical Field
The invention relates to the technical field of robots, in particular to a man-machine interaction method based on Kinect and voice.
Background
In a conventional human-computer interaction system, a graphical user interface based on windows, menus, icons and pointing devices is formed by adopting a WIMP interface, and information is input through keys, knobs or other touch devices. The interactive system can only provide limited options for people to select according to information preset by an interactive system designer, cannot interact with environmental information for a large amount of information, and needs manual input by operators, so that the interactive system needs to be operated by skilled workers in a service link and production and manufacturing. No matter how to optimize the structure or improve the guiding mode for the user, the use difficulty can be reduced, and the purpose of saving the labor cost by reducing the number of the working personnel can not be really achieved.
Relevant patents are found in literature search: an invention patent of 'a man-machine interaction method, device and robot' with application number CN201511016826.8 published in 2016, 3, 23, provides an interaction method based on voice and image information, and the system can determine the identity of a user through the voice information of the user and can judge the input of the user through the action of the user. An invention patent of 'catering service system' with application number CN201510658482.4, published on 3/23/2016, provides a human-computer interaction method for obtaining a user instruction based on a voice processing unit and obtaining a user position through a microphone array.
However, the above patent only relates to how to obtain the user information through the multimedia technology, but cannot obtain the scene information, and it is necessary to ensure that the interactive system is used in a specific scene, and once the scene changes greatly, the interactive system cannot respond or an execution error occurs.
Disclosure of Invention
The invention aims to solve the technical problem of providing a man-machine interaction method based on Kinect and voice aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a man-machine interaction method based on Kinect and voice comprises the following steps:
1) processing the three-dimensional point cloud data to obtain a position under a K coordinate system; the coordinate system K is established by taking a Kinect geometric center as an original point, taking a direction perpendicular to the outward direction of the lens as a Z-axis positive direction and taking a connecting line of circle centers of three lenses of Kinect as an X-axis;
2) fusing the depth image and the RGB respectively acquired by the Kinect to obtain three-dimensional point cloud data;
3) identifying the space point cloud object: processing the three-dimensional point cloud data to obtain a semantic description file;
4) carrying out coordinate transformation on the object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R; the object coordinate system O takes the geometric zhongxing of the point cloud as an origin, takes the longest line segment direction in the object passing through the origin as a Z axis, and a plane passing through the origin and perpendicular to the Z axis is an XY plane; the coordinate system R takes the ground as an XY plane, the projection of the geometric center of the mechanical arm base on the XY plane as an origin, the upward direction perpendicular to the ground passing through the origin is the positive direction of a Z axis, and Y axes are all parallel to a Y axis of the K coordinate system;
5) receiving voice input of a user, and processing an input signal to obtain text information;
6) the textual information and XM L semantic map are input to an intelligent inference engine that generates execution instructions and outputs textual information for the user's response and guidance information.
According to the scheme, the space point cloud object identification process in the step 3) comprises preprocessing, key point extraction and descriptor extraction, and then feature matching is carried out through an object feature database, and finally a semantic description file is obtained.
According to the scheme, in the step 3):
3.1) preprocessing, wherein the preprocessing step is used for filtering point cloud data too far away or too close to a sensor;
3.2) detecting the characteristic points of the point cloud data by adopting an ISS algorithm, wherein the specific process is as follows:
3.2.1) Inquiry of each point p in the input point cloud dataiRadius rframeAll points p withinjAnd calculating the weight according to formula 1;
Wij=1/||pi-pj||,|pi-pj|<rframe(1)
3.2.2) calculating the covariance matrix according to equation 2 based on the weights
Figure GDA0002480671040000031
3.2.3) computing eigenvalues of the covariance matrix
Figure GDA0002480671040000032
Arranging the characteristic values in a descending order;
3.2.4) setting the ratio threshold γ21And gamma32Is reserved to satisfy
Figure GDA0002480671040000041
And
Figure GDA0002480671040000042
the points are key feature points;
3.3) calculating the feature descriptors of the key feature points by the following specific method:
constructing L RF a unique, definite and stable local reference coordinate system by computing covariance matrix of points located on local surface in the neighborhood of the keypoint, and rotating the local surface until L RF is aligned with Ox, Oy and Oz axes of object coordinate system O with the keypoint as a starting point, so that the points have rotation invariance;
then for each axis Ox, Oy, Oz we perform the following steps, we take these axes as the current axis:
3.3.1) the local surface is rotated around the current axis by a specified angle;
3.3.2) the rotated local surface points are projected onto the XY, XZ and YZ planes;
3.3.3) establishing a projection distribution matrix which only displays the number of points contained in each subdomain, wherein the number of subdomains represents the dimension of the matrix and is a parameter of the algorithm as well as the specified angle;
3.3.4) calculating the center-to-center distance of the distribution matrix, i.e., μ11、μ21、μ12、μ22And e;
3.3.5) cascading the calculated values to form sub-features;
the above steps are performed cyclically, the number of iterations depending on the number of given rotations; finally, cascading the sub-features of different coordinate axes to form a final RoPS descriptor;
3.4) matching the characteristic values, wherein the specific method comprises the following steps:
in the patent, a threshold-based feature matching method is used, and in a threshold-based matching mode, if the distance between two descriptors is smaller than a set threshold, it is indicated that the two features are matched consistently.
The distance formula used for the threshold is to characterize the difference between two clusters of objects (a cluster is made up of multiple descriptor sets), i.e., the sum of the Manhattan distances of the geometric centers of the two sets plus the standard deviation of each of their dimensions is as shown in equations 3 and 5:
D(A,B)=L1(CA,CB)+L1(stdA,stdB) (3)
wherein D (A, B) represents the distance difference between two object clusters, namely A and B, CA(i),CB(i) A, B center of a certain dimension, L1 represents Manhattan distance formula, stdA(i) Represents the standard deviation, std, of a certain dimension of cluster AB(i) Representing the standard deviation of a certain dimension of the cluster B;
Figure GDA0002480671040000051
l of two descriptors a and b1The distances are as follows:
Figure GDA0002480671040000052
where n represents the size of the feature descriptor, dimension 135 of the RoPS;
aj(i) a value representing the i-dimension of the RoPS descriptor for the jth keypoint in the a cluster;
| A | represents the number of key points in the cluster A;
| B | represents the number of keypoints in cluster B.
According to the scheme, in the step 4), a proper position is selected to place the mechanical arm, a coordinate system R is established, the coordinate of the origin of a coordinate system K in the coordinate system R is (d, l, h), an object coordinate system O is established by using a PCA method, the posture of the object is obtained through two times of coordinate system transformation from the coordinate system O to the coordinate system K and then to the coordinate system R, the coordinate transformation is carried out on the coordinate system K to obtain the posture information under the coordinate system R, the posture information corresponding to the semantic description file under the coordinate system R is solved, and the XM L semantic map is reproduced.
According to the scheme, the voice recognition process in the step 5) specifically comprises the following steps:
5.1) pretreatment: collecting user voice information through a microphone array, processing an input original voice signal, filtering unimportant information and background noise, and performing endpoint detection, voice framing and pre-emphasis processing on the voice signal;
5.2) feature extraction: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
5.3) carrying out acoustic model modeling by adopting a Hidden Markov Model (HMM), and matching the voice to be recognized with the acoustic model in the recognition process so as to obtain a recognition result;
and 5.4) carrying out grammar and semantic analysis on the training text database, and training based on a statistical model to obtain an N-Gram language model, thereby improving the recognition rate and reducing the search range.
5.5) aiming at the input voice signal, establishing a recognition network according to the trained HMM acoustic model, language model and dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, thereby determining the characters contained in the voice sample.
The invention has the following beneficial effects: the defect that the limited range of the product position is too small in the traditional automatic equipment is overcome by identifying the position of the object; meanwhile, the combination of the voice and the object position information can be applied to the service industry;
drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a kinect sensor model; and a K coordinate system schematic;
FIG. 2 is a schematic illustration of a K coordinate system and ground contrast;
FIG. 3 is an overall flow chart of object recognition;
FIG. 4 is a feature description sub-flowchart;
FIG. 5 is a diagram illustrating the relationship between a K coordinate system and an R coordinate system;
FIG. 6 is an overall flow chart of object pose determination;
FIG. 7 is an overall flow chart of voice interaction;
fig. 8 is a block diagram of the whole system.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a man-machine interaction method based on Kinect and voice includes the following two parts:
a first part of scene interaction, comprising the steps of:
firstly, correctly placing a Kinect and establishing a K coordinate system;
the Kinect is placed right opposite to the object, the detection range of the Kinect is 1.8-3.6 meters, the horizontal visual field is 53 degrees, the vertical visual field is 47 degrees, and the object is arranged corresponding to the object to guarantee that the Kinect can correctly acquire data within the range. Then, a coordinate system K with the center of Kinect as the origin is established as shown in FIG. 1, and Kinect is related to the ground as shown in FIG. 2, wherein the z-axis forms an angle θ with the horizontal plane.
Step two, the Kinect sensor completes target detection and identification;
respectively acquiring a depth image and RGB (red, green and blue) by Kinect, and obtaining three-dimensional point cloud data after fusion processing;
firstly, point cloud data which is too far away or too close to the sensor is filtered through pretreatment, so that the calculation cost can be effectively reduced, the processing speed is increased, and the real-time performance of the system is improved.
After preprocessing, ISS algorithm is selected for feature point detection. And then the detected feature points are characterized by an S/C-RoPS algorithm. And then carrying out feature matching through an object feature database to obtain a semantic description file for identifying the object.
The flow of point cloud data acquisition is shown in fig. 3.
The three steps of extracting key points, calculating feature descriptors and 3D feature matching are described in detail below.
The specific process of extracting the key points is as follows:
(1) inquiring each point p in input point cloud dataiRadius rframeAll points inside, and calculating the weight according to equation 1
Wij=1/||pi-pj||,|pi-pj|<rframe(1)
(2) Calculating a covariance matrix according to formula 2 based on the weights;
Figure GDA0002480671040000091
(3) computing eigenvalues of a covariance matrix
Figure GDA0002480671040000092
Arranging the characteristic values in a descending order;
(4) setting a ratio threshold gamma21And gamma32Reserve to meet
Figure GDA0002480671040000093
And
Figure GDA0002480671040000094
the points are key feature points.
The calculation method of the feature descriptor is as follows:
a unique, unambiguous and stable local reference coordinate system (L RF) is first constructed by computing covariance matrices of points located on local surfaces in the neighborhood of keypoints, rotating the local surfaces with the keypoints as starting points until L RF is aligned with the Ox, Oy and Oz axes, which makes the points rotation invariant, and then for each axis Ox, Oy, Oz we perform the following steps, we take these axes as current axis:
1) rotating the local surface around the current axis by a specified angle;
2) projecting the rotated local surface point onto XY, XZ and YZ planes;
3) establishing a projection distribution matrix which only displays the number of points contained in each subdomain, wherein the number of the subdomains represents the dimension of the matrix and is a parameter of the algorithm as well as a specified angle;
4) calculating the center-to-center distance, i.e. mu, of the distribution matrix11、μ21、μ12、μ22And e;
5) the computed values are concatenated to form sub-features.
The loop performs these several steps a number of times, the number of iterations depending on the number of given rotations. And finally, cascading the sub-features of different coordinate axes to form a final RoPS descriptor.
The shape or color information of the local surface is added into the RoPS, the coded information is expanded and improved, an S/C-RoPS descriptor is generated, a block diagram of the algorithm is shown in FIG. 4, and the accuracy of feature matching is optimized.
The method adopts a confidence-based decision-making layer fusion algorithm to perform data information fusion on the S-RoPS descriptor and the C-RoPS descriptor. The specific idea is that an S-RoPS descriptor or a C-RoPS descriptor is independently used for object recognition, so that the highest confidence coefficient under each single-mode method can be obtained, and the fusion strategy is to compare the confidence coefficients of all candidate model results generated by two independent methods and select the candidate model with the highest confidence coefficient.
The characteristic value matching method comprises the following steps:
a threshold-based feature matching method is used in this patent. In the threshold-based matching mode, if the distance between the two descriptors is smaller than a set threshold, the two features are consistent and matched.
The distance formula used for the threshold is to characterize the difference between two clusters of objects (a cluster is made up of multiple descriptor sets), i.e., the sum of the Manhattan distances of the geometric centers of the two sets plus the standard deviation of each of their dimensions is as shown in equations 3 and 5:
D(A,B)=L1(CA,CB)+L1(stdA,stdB) (3)
Figure GDA0002480671040000111
stdBis calculated and stdASimilarly, n represents the size of the feature descriptor
L of two descriptors a and b1The distances are as follows:
Figure GDA0002480671040000112
and thirdly, selecting a proper position mechanical arm, establishing a coordinate system R, solving the pose under the K coordinate system, converting the position and the pose information under the K into the coordinate and the pose information under the coordinate system R through coordinate transformation and coordinate system transformation (an object coordinate system O is a temporary variable generated for solving the pose, has no practical meaning other than the point of an origin and is K to R instead of O to R), and producing the XM L semantic map.
Selecting a proper position to place the mechanical arm, establishing a coordinate system R as shown in FIG. 5, wherein the coordinate of the origin of the coordinate system K in the coordinate system R is (d, l, h), establishing an object coordinate system O by using a PCA method, and obtaining corresponding pose information under the R coordinate system through two times of coordinate system transformation and one time of coordinate transformation under the K coordinate system.
1) Calculating the geometric center of an object point cloud
Figure GDA0002480671040000121
i represents the number of points, decentralizing all point sets
Figure GDA0002480671040000122
Arranging the coordinates of all the points after the decentralization into a matrix of 3 × N
Figure GDA0002480671040000123
2) Let M be A. ATAnd calculating the eigenvalue and eigenvector of M: lambda [ alpha ]i·Vi=M·ViI 1,2,3, and normalizes the feature vector to | | | Vi1, the long axis direction of the object corresponds to the characteristic vector of the maximum characteristic value, and lambda is set1≤λ2≤λ3Then, a rotation matrix of the object coordinate system relative to the coordinate system K can be obtained
Figure GDA0002480671040000124
The translation matrix is the geometric center of the object point cloud
Figure GDA0002480671040000125
The pose of the object coordinate system under the coordinate system K is as follows:
Figure GDA0002480671040000126
is provided withcamC={Pi}, then
Figure GDA0002480671040000127
And representing point clouds under a model library object coordinate system, determining short axis and secondary long axis planes according to the long axis and the central point, and determining the directions of the short axis and the secondary long axis according to the extreme value distribution of the plane points.
In the matching stage, in order to obtain a transformation matrix from an actual object to an object in a model library, a three-point method is adopted to calculate the pose of six degrees of freedom, and two corresponding three-dimensional points are integrated into a retaining pocketmodP},{objP, if the rigid body transformation relation is satisfied
Figure GDA0002480671040000128
Wherein
Figure GDA0002480671040000129
Figure GDA00024806710400001210
For the rotation matrix and translation vector of the two-point set, the least square method is used to solve the optimal solution to obtain the solution that minimizes E in the formula 8
Figure GDA00024806710400001211
And
Figure GDA00024806710400001212
Figure GDA0002480671040000131
then the transformation matrix from the actual object to the model library object is as follows:
Figure GDA0002480671040000132
the pose matrix of the actual object to the sensor coordinate system is as follows:
Figure GDA0002480671040000133
the rotation matrix can be converted into yaw α, pitch β, roll γ to describe its attitude as in equation 11, and the translation matrix can be converted into center coordinates to describe its position.
Figure GDA0002480671040000134
Figure GDA0002480671040000135
Figure GDA0002480671040000136
Wherein r isijRepresenting the corresponding elements of the i rows and j columns of the rotation matrix.
The relationship between the coordinate system R and the coordinate system K is shown in FIG. 5, and the transformation matrix of the two is shown in equation 12.
Figure GDA0002480671040000137
Where θ represents the tilt angle of Kinect with respect to the horizontal plane, { x, y, z } are coordinate values of the object in the coordinate system R, { xk,yk,zkIs the coordinate value of the object under the coordinate system K。
The attitude matrix of the object to the coordinate system R is as follows;
Figure GDA0002480671040000141
wherein
Figure GDA0002480671040000142
The second part is voice man-machine interaction, which comprises the following steps:
step one, a user sends out a voice command, and the voice command is converted into text information after being processed.
After receiving the voice of the user, the text information is finally obtained through preprocessing and voice decoding, and the specific flow is as shown in fig. 7:
inputting the text information and the XM L semantic map into an intelligent inference machine, generating an execution instruction by the inference machine and outputting the text information;
a user constructs a semantic map file of a current scene through a voice control three-dimensional map real-time generation module, a voice recognition node and a voice synthesis node realize man-machine conversation through sending and receiving texts respectively, an intelligent inference machine node analyzes and feeds back information by combining the map file, a solution scheme which is expected by the user is perfected through deep conversation, and the solution scheme is finally generated and sent to a scheme analysis and motion planning module. The PocketSphnix open source speech recognition system is used for speech recognition, and the Ekho open source speech synthesis system is used for speech synthesis.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (2)

1. A man-machine interaction method based on Kinect and voice is characterized by comprising the following steps:
1) acquiring accurate spatial position and attitude information of each object in a scene in a coordinate system K by using a Kinect sensor to finish target detection and identification; the coordinate system K is established by taking a Kinect geometric center as an original point, taking a direction perpendicular to the outward direction of the lens as a Z-axis positive direction and taking a connecting line of circle centers of three lenses of Kinect as an X-axis;
2) fusing the depth image and the RGB respectively acquired by the Kinect to obtain three-dimensional point cloud data;
3) identifying the space point cloud object: processing the three-dimensional point cloud data to obtain a semantic description file; the step 3) is as follows:
3.1) preprocessing, wherein the preprocessing step is used for filtering point cloud data too far away or too close to a sensor;
3.2) detecting the characteristic points of the point cloud data by adopting an ISS algorithm, wherein the specific process is as follows:
3.2.1) Inquiry of each point p in the input point cloud dataiRadius rframeAll points p withinjAnd calculating the weight according to formula 1;
wij=1/||pi-pj||,|pi-pj|<rframe(1)
3.2.2) calculating the covariance matrix according to equation 2 based on the weights
Figure FDA0002513153960000011
3.2.3) computing eigenvalues of the covariance matrix
Figure FDA0002513153960000012
Arranging the characteristic values in a descending order;
3.2.4) setting the ratio threshold γ21And gamma32Is reserved to satisfy
Figure FDA0002513153960000013
And
Figure FDA0002513153960000014
the points are key feature points;
3.3) calculating the feature descriptors of the key feature points by the following specific method:
firstly, a unique, definite and stable local reference coordinate system L RF is constructed by calculating covariance matrixes of points on local surfaces in the neighborhood of key points, and the local surfaces are rotated until L RF is aligned with the Ox, Oy and Oz axes of an object coordinate system O by taking the key points as starting points to enable the points to have rotation invariance;
then for each axis Ox, Oy, Oz we perform the following steps, we take these axes as the current axis:
3.3.1) the local surface is rotated around the current axis by a specified angle;
3.3.2) the rotated local surface points are projected onto the XY, XZ and YZ planes;
3.3.3) establishing a projection distribution matrix, wherein the matrix only displays the number of points contained in each subdomain, the number of subdomains represents the dimension of the matrix, and the number of points contained in each subdomain is a parameter calculated by a feature descriptor as well as a specified angle;
3.3.4) calculating the center distance mu of the distribution matrix11、μ21、μ12、μ22And e;
3.3.5) cascading the calculated values to form sub-features;
the above steps are performed cyclically, the number of iterations depending on the number of given rotations; finally, cascading the sub-features of different coordinate axes to form a final RoPS descriptor;
3.4) matching the characteristic values, wherein the specific method comprises the following steps:
using a characteristic matching method based on a threshold value, and under a matching mode based on the threshold value, if the distance between the two descriptors is smaller than a set threshold value, indicating that the two characteristics are matched in a consistent manner;
the distance formula used for the threshold is the sum of the Manhattan distances characterizing the difference between two object clusters, i.e. the geometric centers of the two sets plus the standard deviation of each dimension, and the calculation formula adopts the following formula (3) and formula (5):
D(A,B)=L1(CA,CB)+L1(stdA,stdB) (3)
whereinD (A, B) represents the distance difference between two object clusters, namely A and B, to each dimension i, CA,CBA, B center of a certain dimension, L1 represents Manhattan distance formula, stdA(i) Represents the standard deviation, std, of a certain dimension of cluster AB(i) Representing the standard deviation of a certain dimension of the cluster B;
Figure FDA0002513153960000031
n represents the size of the feature descriptor;
l of two descriptors a and b1The distances are as follows:
Figure FDA0002513153960000032
aj(i) a value representing the i-dimension of the RoPS descriptor for the jth keypoint in the a cluster;
bj(i) a value representing the i-dimension of the RoPS descriptor for the jth keypoint in the B cluster;
| A | represents the number of key points in the cluster A;
| B | represents the number of key points in the cluster B;
4) carrying out coordinate transformation on the object coordinate system O to obtain a three-dimensional scene semantic map description file under a coordinate system R;
the step 4) is as follows:
selecting a proper position to place the mechanical arm, establishing a coordinate system R, wherein the coordinate of the origin of a coordinate system K in the coordinate system R is (d, l, h), establishing an object coordinate system O by using a PCA method, and obtaining the posture of the object through two times of coordinate system transformation from the coordinate system O to the coordinate system K and then to the coordinate system R;
5) receiving voice input of a user, and processing an input signal to obtain text information;
6) the textual information and XM L semantic map are input to an intelligent inference engine that generates execution instructions and outputs textual information for the user's response and guidance information.
2. The Kinect and voice-based human-computer interaction method as claimed in claim 1, wherein the step 5) voice recognition process specifically comprises the steps of:
5.1) pretreatment: collecting user voice information through a microphone array, processing an input original voice signal, filtering unimportant information and background noise, and performing endpoint detection, voice framing and pre-emphasis processing on the voice signal;
5.2) feature extraction: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;
5.3) performing acoustic model modeling by adopting a Hidden Markov Model (HMM), and matching the voice to be recognized with the acoustic model in the recognition process so as to obtain a recognition result;
5.4) carrying out grammar and semantic analysis on the training text database, and training based on a statistical model to obtain an N-Gram language model, thereby improving the recognition rate and reducing the search range;
5.5) aiming at the input voice signal, establishing a recognition network according to the trained HMM acoustic model, language model and dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, thereby determining the characters contained in the voice sample.
CN201610306998.7A 2016-05-10 2016-05-10 Man-machine interaction method based on Kinect and voice Active CN106055244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610306998.7A CN106055244B (en) 2016-05-10 2016-05-10 Man-machine interaction method based on Kinect and voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610306998.7A CN106055244B (en) 2016-05-10 2016-05-10 Man-machine interaction method based on Kinect and voice

Publications (2)

Publication Number Publication Date
CN106055244A CN106055244A (en) 2016-10-26
CN106055244B true CN106055244B (en) 2020-08-04

Family

ID=57176838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610306998.7A Active CN106055244B (en) 2016-05-10 2016-05-10 Man-machine interaction method based on Kinect and voice

Country Status (1)

Country Link
CN (1) CN106055244B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108873707A (en) * 2017-05-10 2018-11-23 杭州欧维客信息科技股份有限公司 Speech-sound intelligent control system
CN109839622B (en) * 2017-11-29 2022-08-12 武汉科技大学 Multi-target tracking method for parallel computing particle probability hypothesis density filtering
CN111666797B (en) * 2019-03-08 2023-08-08 深圳市速腾聚创科技有限公司 Vehicle positioning method, device and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103472916A (en) * 2013-09-06 2013-12-25 东华大学 Man-machine interaction method based on human body gesture recognition
CN104571485A (en) * 2013-10-28 2015-04-29 中国科学院声学研究所 System and method for human and machine voice interaction based on Java Map

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103472916A (en) * 2013-09-06 2013-12-25 东华大学 Man-machine interaction method based on human body gesture recognition
CN104571485A (en) * 2013-10-28 2015-04-29 中国科学院声学研究所 System and method for human and machine voice interaction based on Java Map

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Intrinsic shape signatures: A shape descriptor for 3D object recognition;Yu Zhong;《2009 IEEE 12th International on Computer Vision Workshops》;20091130;第1-8页 *
RoPs(Rotational Projection Statistics) feature;Point Cloud Library;《http://pointclouds.org/documentation/tutorials/ropses feature.php》;20101112;第1-5页 *
一种实时的三维语义地图生成方法;吴凡 等;《计算机工程与应用》;20151109;第53卷(第6期);第67-72页 *
基于自然语言的分拣机器人解析器技术研究;熊志恒 等;《计算机工程与应用》;20151221;第58卷(第3期);第113-119页 *
语音识别的基础知识与CMUsphinx介绍;kylinfish;《https://www.cnblogs.com/kylinfish/articles/3627188.html》;20140326;第1-8页 *

Also Published As

Publication number Publication date
CN106055244A (en) 2016-10-26

Similar Documents

Publication Publication Date Title
CN106682598B (en) Multi-pose face feature point detection method based on cascade regression
Matuszek et al. Learning from unscripted deictic gesture and language for human-robot interactions
Gao et al. Sign language recognition based on HMM/ANN/DP
CN110097553A (en) The semanteme for building figure and three-dimensional semantic segmentation based on instant positioning builds drawing system
Schauerte et al. Multimodal saliency-based attention for object-based scene analysis
CN111432989A (en) Artificially enhanced cloud-based robot intelligence framework and related methods
CN106095109B (en) The method for carrying out robot on-line teaching based on gesture and voice
CN110554774A (en) AR-oriented navigation type interactive normal form system
JP2007538318A (en) Sign-based human-machine interaction
CN105931218A (en) Intelligent sorting method of modular mechanical arm
CN106055244B (en) Man-machine interaction method based on Kinect and voice
CN113361636B (en) Image classification method, system, medium and electronic device
CN108320051B (en) Mobile robot dynamic collision avoidance planning method based on GRU network model
CN110135277B (en) Human behavior recognition method based on convolutional neural network
CN110781920A (en) Method for identifying semantic information of cloud components of indoor scenic spots
Geng et al. Combining features for chinese sign language recognition with kinect
CN113012122A (en) Category-level 6D pose and size estimation method and device
CN112101243A (en) Human body action recognition method based on key posture and DTW
Fransen et al. Using vision, acoustics, and natural language for disambiguation
CN109255815B (en) A kind of object detection and recognition methods based on order spherical harmonic
WO2021103558A1 (en) Rgb-d data fusion-based robot vision guiding method and apparatus
Ryumin et al. Automatic detection and recognition of 3D manual gestures for human-machine interaction
Canal et al. Gesture based human multi-robot interaction
CN111695408A (en) Intelligent gesture information recognition system and method and information data processing terminal
CN116249607A (en) Method and device for robotically gripping three-dimensional objects

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant