CN106055244A

CN106055244A - Man-machine interaction method based on Kincet and voice

Info

Publication number: CN106055244A
Application number: CN201610306998.7A
Authority: CN
Inventors: 闵华松; 齐诗萌; 李潇; 林云汉; 吴凡
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE
Priority date: 2016-05-10
Filing date: 2016-05-10
Publication date: 2016-10-26
Anticipated expiration: 2036-05-10
Also published as: CN106055244B

Abstract

The invention discloses a human-computer interaction method based on Kincet and voice, comprising the following steps: 1) using a Kinect sensor to acquire accurate spatial position and attitude information of each object in the kinect coordinate system K in the scene, and completing target detection and recognition; 2) ) The depth image and RGB collected by Kinect are fused to obtain 3D point cloud data; 3) Spatial point cloud object recognition: processing the 3D point cloud data to obtain a semantic description file; 4) Coordinate transformation of the object coordinate system O Obtain the 3D scene semantic map description file under the coordinate system R; 5) Receive user voice input, process the input signal to obtain text information; 6) Input the text information and XML semantic map into the intelligent reasoning machine, the reasoning machine generates execution instructions and Outputs text messages for user replies and guidance messages.

Description

A Human-Computer Interaction Method Based on Kincet and Voice

技术领域technical field

本发明涉及机器人技术领域，尤其涉及一种基于Kincet和语音的人机交互方法。The invention relates to the technical field of robots, in particular to a human-computer interaction method based on Kincet and voice.

背景技术Background technique

传统的人机交互系统，多采用WIMP界面形成了以窗口、菜单、图符和指示装置为基础的图形用户界面，通过按键、旋钮或其他触碰装置输入信息。这种交互系统只能根据交互系统设计者预设的信息提供有限的选项供人选择，无法与环境信息进行交互大量信息还需要操作人员手动输入，无论是应用在服务环节还是生产制造中都需要有熟练工作人员操作。无论如何优化其结构或改进对用户的引导方式，都只能降低使用难度不能真正减少工作人员数量节约人力成本的目的。Traditional human-computer interaction systems mostly use WIMP interface to form a graphical user interface based on windows, menus, icons and pointing devices, and input information through buttons, knobs or other touch devices. This kind of interactive system can only provide limited options for people to choose according to the information preset by the interactive system designer, and cannot interact with environmental information. A large amount of information needs to be manually input by the operator, whether it is applied in the service link or in manufacturing. There are skilled staff to operate. No matter how to optimize its structure or improve the way of guiding users, it can only reduce the difficulty of use and cannot really reduce the number of staff and save labor costs.

文献检索查到相关专利：2016年3月23日公开的申请号为CN201511016826.8的发明专利《一种人机互动的方法、装置及机器人》，提出了一种基于语音和图像信息的交互方法，系统可以通过用户的语音信息确定用户的身份并能通过用户的动作判断用户的输入。2016年3月23日公开的申请号为CN201510658482.4的发明专利《餐饮服务系统》，提出了一种基于语音处理单元获取用户指令和通过麦克风阵列得出用户位置的人机交互方法。Relevant patents were found by literature search: the invention patent "A Method, Device and Robot for Human-Computer Interaction" published on March 23, 2016 with the application number CN201511016826.8, proposed an interactive method based on voice and image information , the system can determine the user's identity through the user's voice information and judge the user's input through the user's actions. The invention patent "Catering Service System" published on March 23, 2016 with the application number CN201510658482.4 proposes a human-computer interaction method based on the voice processing unit to obtain user instructions and obtain the user's location through the microphone array.

但是，上述专利只涉及如何通过多媒体技术获取用户信息，但无法通过获取场景信息，必须保证交互系统用在特定的场景，一旦场景发生较大变化交互系统将无法应答或出现执行错误。However, the above-mentioned patent only involves how to obtain user information through multimedia technology, but cannot obtain scene information. It must be ensured that the interactive system is used in a specific scene. Once the scene changes greatly, the interactive system will not be able to respond or an execution error will occur.

发明内容Contents of the invention

本发明要解决的技术问题在于针对现有技术中的缺陷，提供一种基于Kincet和语音的人机交互方法。The technical problem to be solved by the present invention is to provide a human-computer interaction method based on Kincet and voice in view of the defects in the prior art.

本发明解决其技术问题所采用的技术方案是：一种基于Kincet和语音的人机交互方法，包括以下步骤：The technical solution adopted by the present invention to solve the technical problems is: a kind of human-computer interaction method based on Kincet and voice, comprising the following steps:

2)处理三维点云数据求取在K坐标系下的位置；所述坐标系K为以kinect几何中心为原点，以垂直于镜头向外的方向为Z轴正方向，以Kincet三个镜头的圆心的连线为X轴，建立坐标系；2) process the three-dimensional point cloud data to obtain the position under the K coordinate system; the coordinate system K is to take the geometric center of the kinect as the origin, take the direction perpendicular to the outward direction of the lens as the positive direction of the Z axis, and take the three lenses of the Kincet The line connecting the center of the circle is the X axis, and the coordinate system is established;

1)将Kinect分别采集的深度图像和RGB经过融合处理后得到三维点云数据；1) The depth image and RGB collected by Kinect are fused to obtain 3D point cloud data;

3)空间点云物体识别：对三维点云数据进行处理得到语义描述文件；3) Spatial point cloud object recognition: process the 3D point cloud data to obtain a semantic description file;

4)对物体坐标系O进行坐标变换得到坐标系R下的三维场景语义地图描述文件；物体坐标系O以点云的几何中兴为原点，以过原点的物体内部最长的线段方向为Z轴，过原点垂直于Z轴的平面就是XY平面；坐标系R以地面为XY平面，机械臂底座的几何中心为在XY平面上的投影为原点，过原点垂直于地面向上为Z轴正方向，Y轴均平行于K坐标系的y轴；4) Perform coordinate transformation on the object coordinate system O to obtain the 3D scene semantic map description file under the coordinate system R; the object coordinate system O takes the geometric center of the point cloud as the origin, and takes the longest line segment direction inside the object passing through the origin as the Z axis , the plane passing through the origin and perpendicular to the Z axis is the XY plane; the coordinate system R takes the ground as the XY plane, the geometric center of the manipulator base is the projection on the XY plane as the origin, and passing through the origin perpendicular to the ground is the positive direction of the Z axis. The Y axis is parallel to the y axis of the K coordinate system;

5)接收用户语音输入，对输入信号进行处理，得到文本信息；5) Receive user voice input, process the input signal, and obtain text information;

6)将文本信息和XML语义地图输入智能推理机，推理机产生执行指令并输出对用户的答复和引导信息的文本信息。6) Input the text information and XML semantic map into the intelligent inference engine, and the inference engine generates execution instructions and outputs text information of replies to users and guidance information.

按上述方案，所述步骤3)空间点云物体识别过程包括预处理、关键点提取、描述子提取，再通过物体特征数据库进行特征匹配，最后获得语义描述文件。According to the above scheme, the step 3) object recognition process of the spatial point cloud includes preprocessing, key point extraction, descriptor extraction, and then feature matching through the object feature database, and finally obtains the semantic description file.

按上述方案，所述步骤3)中：According to the above scheme, in the step 3):

3.1)预处理，所述预处理步骤用于滤除距离传感器过远或过近的点云数据；3.1) preprocessing, the preprocessing step is used to filter out point cloud data that is too far or too close to the sensor;

3.2)采用ISS算法对点云数据进行特征点检测，具体过程如下：3.2) Using the ISS algorithm to detect feature points on the point cloud data, the specific process is as follows:

3.2.1)查询输入点云数据中每一个点p_i半径r_frame内所有点p_j，并按照公式1计算权重；3.2.1) Query all points p _j within the radius r _frame of each point p _i in the input point cloud data, and calculate the weight according to formula 1;

W_ij＝1/||p_i-p_j||，|p_i-p_j|＜r_frame (1)W _ij ＝1/||p _i -p _j ||, |p _i -p _j |<r _frame (1)

3.2.2)根据权重按照公式2计算协方差矩阵3.2.2) Calculate the covariance matrix according to formula 2 according to the weight

$C C O o V V (({p p}_{i i})) = = \underset{| | {p p}_{i i} - - {p p}_{j j} | | < < {r r}_{f f r r a a m m e e}}{Σ Σ} {w w}_{i i j j} (({p p}_{i i} - - {p p}_{j j})) {(({p p}_{i i} - - {p p}_{j j}))}^{T T} / / \underset{| | {p p}_{i i} - - {p p}_{j j} | | < < {r r}_{f f r r a a m m e e}}{Σ Σ} {w w}_{i i j j} - - - - - - ((22))$

3.2.3)计算协方差矩阵的特征值并将特征值按照从大到小顺序排列；3.2.3) Calculate the eigenvalues of the covariance matrix And arrange the eigenvalues in descending order;

3.2.4)设置比率阈值γ₂₁和γ₃₂，保留满足和的点集，这些点即为关键特征点；3.2.4) Set ratio thresholds γ ₂₁ and γ ₃₂ , keep satisfying and The set of points, these points are the key feature points;

3.3)关键特征点的特征描述子计算，具体方法如下：3.3) The feature descriptor calculation of key feature points, the specific method is as follows:

首先通过计算位于关键点邻域局部表面的点的协方差矩阵来构建一个独特的、明确的和稳定的局部参考坐标系LRF，以关键点作为起始点，旋转局部表面直到LRF与物体坐标系O的Ox，Oy和Oz轴对齐，这样可以使点具有旋转不变性；Firstly, a unique, clear and stable local reference coordinate system LRF is constructed by calculating the covariance matrix of points located on the local surface in the neighborhood of the key point, with the key point as the starting point, the local surface is rotated until the LRF is consistent with the object coordinate system O The Ox, Oy and Oz axes are aligned, which makes the point invariant to rotation;

然后对每个轴Ox，Oy，Oz执行如下几步，我们把这些轴作为当前轴：Then perform the following steps for each axis Ox, Oy, Oz, we take these axes as the current axis:

3.3.1)局部表面以指定角度绕当前轴旋转；3.3.1) The local surface rotates around the current axis at a specified angle;

3.3.2)被旋转的局部表面点投影到XY，XZ和YZ平面上；3.3.2) The rotated local surface points are projected onto the XY, XZ and YZ planes;

3.3.3)建立投影分布矩阵，这个矩阵仅仅显示每个子域包含的点的数量，子域的数量代表矩阵的维数，和指定角度一样它也是本算法的一个参量；3.3.3) Establish a projection distribution matrix, which only shows the number of points contained in each subfield, and the number of subfields represents the dimension of the matrix, which is also a parameter of the algorithm like the specified angle;

3.3.4)计算分布矩阵中心距，即μ₁₁、μ₂₁、μ₁₂、μ₂₂和e；3.3.4) Calculate the center distance of the distribution matrix, namely μ ₁₁ , μ ₂₁ , μ ₁₂ , μ ₂₂ and e;

3.3.5)计算的值级联组成子特征；3.3.5) The calculated values are cascaded to form sub-features;

循环执行上述步骤，迭代次数取决于给定的旋转的数目；最后，将不同坐标轴的子特征级联形成最终的RoPS描述子；The above steps are executed cyclically, and the number of iterations depends on the number of given rotations; finally, the sub-features of different coordinate axes are concatenated to form the final RoPS descriptor;

3.4)特征值匹配，具体方法如下：3.4) Eigenvalue matching, the specific method is as follows:

本专利中使用基于阀值的特征匹配法，基于阈值的匹配模式下，如果两个描述子之间的距离小于设定的阈值，则表明两个特征一致匹配。This patent uses a threshold-based feature matching method. In the threshold-based matching mode, if the distance between two descriptors is less than the set threshold, it indicates that the two features match consistently.

阀值所使用的距离公式为表征两个物体聚类之间的差异(一个聚类由多个描述子集合构成)，即两个集合的几何中心加上它们每一维度的标准偏差的曼哈顿距离之和如式3和式5：The distance formula used by the threshold is to characterize the difference between two object clusters (a cluster is composed of multiple descriptor sets), that is, the geometric center of the two sets plus the Manhattan distance of their standard deviation in each dimension The sum is as formula 3 and formula 5:

D(A,B)＝L₁(C_A,C_B)+L₁(std_A,std_B) (3)D(A,B)＝L ₁ (C _A ,C _B )+L ₁ (std _A ,std _B ) (3)

其中，D(A,B)代表两个物体聚类即A和B的距离差，C_A(i),C_B(i)分别为A、B某一维度的中心，L1代表曼哈顿距离公式，std_A(i)代表聚类A某一维度的标准偏差，std_B(i)代表聚类B某一维度的标准偏差；Among them, D(A,B) represents the distance difference between two object clusters, that is, A and B, C _A (i), C _B (i) are the centers of a certain dimension of A and B, respectively, and L1 represents the Manhattan distance formula, std _A (i) represents the standard deviation of a certain dimension of cluster A, and std _B (i) represents the standard deviation of a certain dimension of cluster B;

${std std}_{A A} ((i i)) = = \sqrt{\frac{11}{| | A A | |} {Σ Σ}_{j j = = 11}^{| | A A | |} {(({a a}_{j j} ((i i)) - - {C C}_{A A} ((i i))))}^{22}},, i i = = 11,, ... ...,, n no - - - - - - ((44))$

两个描述子a和b的L₁距离如下：The L1 distance of two descriptors _a and b is as follows:

${L L}_{11} ((a a,, b b)) = = {Σ Σ}_{i i = = 11}^{n no} | | a a ((i i)) - - b b ((i i)) | | - - - - - - ((55)),,$

其中，n代表特征描述子的大小，即RoPS的维度135；Among them, n represents the size of the feature descriptor, that is, the dimension of RoPS is 135;

a_j(i)代表A聚类中第j个关键点的RoPS描述子的i维度的值；a _j (i) represents the value of the i dimension of the RoPS descriptor of the jth key point in the A cluster;

|A|代表聚类A中关键点的数量；|A| represents the number of key points in cluster A;

|B|代表聚类B中关键点的数量。|B| represents the number of keypoints in cluster B.

按上述方案，所述步骤4)中，具体如下：选取合适的位置放置机械臂，建立坐标系R，坐标系K原点在坐标系R中的坐标为(d,l,h),利用PCA法建立物体坐标系O，经过用坐标系O到坐标系K再到坐标系R两次坐标系变换得出物体的姿态；从坐标系K下的坐标进行坐标变换得到R坐标系下的姿态信息，求出R坐标系下语义描述文件对应的位姿信息，再生产XML语义地图。According to the above scheme, in the step 4), the details are as follows: select a suitable position to place the mechanical arm, establish a coordinate system R, the coordinates of the origin of the coordinate system K in the coordinate system R are (d, l, h), and use the PCA method The object coordinate system O is established, and the attitude of the object is obtained through two coordinate system transformations from the coordinate system O to the coordinate system K and then to the coordinate system R; the coordinate transformation is performed from the coordinates under the coordinate system K to obtain the attitude information under the R coordinate system, Calculate the pose information corresponding to the semantic description file in the R coordinate system, and reproduce the XML semantic map.

按上述方案，所述步骤5)语音识别过程具体包括如下步骤：According to the above scheme, described step 5) speech recognition process specifically includes the following steps:

5.1)预处理：通过麦克风阵列采集用户语音信息，对输入的原始语音信号进行处理，滤除掉其中的不重要的信息以及背景噪声，并进行语音信号的端点检测、语音分帧以及预加重处理；5.1) Preprocessing: collect user voice information through the microphone array, process the input original voice signal, filter out the unimportant information and background noise, and perform endpoint detection, voice framing and pre-emphasis processing of the voice signal ;

5.2)特征提取：提取出反映语音信号特征的关键特征参数形成特征矢量序列；5.2) feature extraction: extract key feature parameters that reflect the characteristics of the speech signal to form a feature vector sequence;

5.3)采用隐马尔科夫模型(HMM)进行声学模型建模，在识别的过程中将待识别的语音与声学模型进行匹配，从而获取识别结果；5.3) The hidden Markov model (HMM) is used for acoustic model modeling, and the speech to be recognized is matched with the acoustic model during the recognition process to obtain the recognition result;

5.4)对训练文本数据库进行语法、语义分析，经过基于统计模型训练得到N-Gram语言模型，从而提高识别率，减少搜索范围。5.4) Perform grammatical and semantic analysis on the training text database, and obtain an N-Gram language model through statistical model training, thereby improving the recognition rate and reducing the search range.

5.5)针对输入的语音信号，根据己经训练好的HMM声学模型、语言模型及字典建立一个识别网络，根据搜索算法在该网络中寻找最佳的一条路径，这个路径就是能够以最大概率输出该语音信号的词串，从而确定这个语音样本所包含的文字。5.5) For the input speech signal, build a recognition network according to the trained HMM acoustic model, language model and dictionary, and search for the best path in the network according to the search algorithm. The word string of the speech signal, so as to determine the text contained in the speech sample.

本发明产生的有益效果是：通过识别物体的位置解决了传统自动化设备，产品位置限定范围太小的缺点；同时语音与物体位置信息的结合能在服务行业中有所应用；The beneficial effects produced by the present invention are: by identifying the position of the object, the shortcomings of the traditional automation equipment and the limited range of the product position are solved; at the same time, the combination of voice and object position information can be applied in the service industry;

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明，附图中：The present invention will be further described below in conjunction with accompanying drawing and embodiment, in the accompanying drawing:

图1是kinect传感器模型；以及K坐标系示意图；Fig. 1 is a kinect sensor model; and a schematic diagram of the K coordinate system;

图2是K坐标系和地面对比的示意图；Figure 2 is a schematic diagram of the comparison between the K coordinate system and the ground;

图3是物体识别整体流程图；Figure 3 is an overall flowchart of object recognition;

图4是特征描述子流程图；Fig. 4 is a feature description sub-flow chart;

图5是K坐标系和R坐标系的关系示意图；Fig. 5 is a schematic diagram of the relationship between the K coordinate system and the R coordinate system;

图6是物体位姿求取整体流程图；Fig. 6 is an overall flow chart of object pose calculation;

图7是语音交互整体流程图；Figure 7 is an overall flow chart of voice interaction;

图8是系统整体框图。Figure 8 is a block diagram of the overall system.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

如图1所示，一种基于Kincet和语音的人机交互方法，包括以下两个部分：As shown in Figure 1, a human-computer interaction method based on Kincet and voice includes the following two parts:

第一部分场景交互，其中包括以下步骤：The first part of scene interaction includes the following steps:

步骤一、正确安放Kinect，建立K坐标系；Step 1. Place the Kinect correctly and establish the K coordinate system;

将Kinect放在物体的正对面，Kinect探测范围为1.8～3.6米，水平视野为53°垂直视野为47°，物体对应该保证摆设的物体在范围之内确保Kinect能正确采集数据。然后如图1所示建立以kinect的中心为原点的坐标系K，Kinect与地面关系如图2，其中z轴与水平面的夹角为θ。Put the Kinect directly opposite the object. The detection range of the Kinect is 1.8-3.6 meters, the horizontal field of view is 53° and the vertical field of view is 47°. Objects should be placed within the range to ensure that the Kinect can collect data correctly. Then establish a coordinate system K with the center of the kinect as the origin as shown in Figure 1, and the relationship between Kinect and the ground is shown in Figure 2, where the angle between the z-axis and the horizontal plane is θ.

步骤二、Kinect传感器完成目标检测与识别；Step 2, the Kinect sensor completes target detection and recognition;

Kinect分别采集深度图像和RGB经过融合处理后得到三维点云数据；Kinect collects depth images and RGB respectively to obtain 3D point cloud data after fusion processing;

首先经过预处理滤除距离传感器过远或过近的点云数据，这样可以有效降低计算成本，提高处理速度，改善系统实时性。First, the point cloud data that is too far or too close to the sensor is filtered out through preprocessing, which can effectively reduce computing costs, increase processing speed, and improve system real-time performance.

预处理之后，选择ISS算法进行特征点检测。然后对检测到特征点以S/C-RoPS算法进行特征描述。再通过物体特征数据库进行特征匹配得设别出物体到语义描述文件。After preprocessing, select the ISS algorithm for feature point detection. Then, the detected feature points are characterized by the S/C-RoPS algorithm. Then carry out feature matching through the object feature database to identify the object and transfer it to the semantic description file.

点云数据采集流程如图3。The point cloud data acquisition process is shown in Figure 3.

下面详细叙述提取关键点、计算特征描述子和3D特征匹配三个步骤。The three steps of extracting key points, calculating feature descriptors and 3D feature matching are described in detail below.

其中关键点提取的具体过程如下：The specific process of key point extraction is as follows:

(1)查询输入点云数据中每一个点p_i半径r_frame内所有点，并按照公式1计算权重(1) Query all points within the radius r _frame of each point p _i in the input point cloud data, and calculate the weight according to formula 1

(2)根据权重按照公式2计算协方差矩阵；(2) Calculate the covariance matrix according to formula 2 according to the weight;

(3)计算协方差矩阵的特征值并将特征值按照从大到小顺序排列；(3) Calculate the eigenvalues of the covariance matrix And arrange the eigenvalues in descending order;

(4)设置比率阈值γ₂₁和γ₃₂保留满足和的点集，这些点即为关键特征点。(4) Set ratio thresholds γ ₂₁ and γ ₃₂ to keep satisfying and The set of points, these points are the key feature points.

其中特征描述子的计算方法如下：The calculation method of the feature descriptor is as follows:

首先通过计算位于关键点邻域局部表面的点的协方差矩阵来构建一个独特的、明确的和稳定的局部参考坐标系(LRF)，以关键点作为起始点，旋转局部表面直到LRF与Ox，Oy和Oz轴对齐，这样可以使点具有旋转不变性；然后对每个轴Ox，Oy，Oz执行如下几步，我们把这些轴作为当前轴：Firstly, a unique, clear and stable local reference frame (LRF) is constructed by calculating the covariance matrix of points located on the local surface in the neighborhood of the key point, with the key point as the starting point, rotating the local surface until LRF and Ox, The Oy and Oz axes are aligned so that the point has rotation invariance; then perform the following steps for each axis Ox, Oy, Oz, we take these axes as the current axis:

1)局部表面以指定角度绕当前轴旋转；1) The local surface rotates around the current axis at a specified angle;

2)被旋转的局部表面点投影到XY，XZ和YZ平面上；2) The rotated local surface points are projected onto the XY, XZ and YZ planes;

3)建立投影分布矩阵，这个矩阵仅仅显示每个子域包含的点的数量，子域的数量代表矩阵的维数，和指定角度一样它也是本算法的一个参量；3) Establish a projection distribution matrix, which only shows the number of points contained in each subfield, and the number of subfields represents the dimension of the matrix, which is also a parameter of the algorithm the same as the specified angle;

4)计算分布矩阵中心距，即μ₁₁、μ₂₁、μ₁₂、μ₂₂和e；4) Calculate the center distance of the distribution matrix, namely μ ₁₁ , μ ₂₁ , μ ₁₂ , μ ₂₂ and e;

5)计算的值级联组成子特征。5) The calculated values are concatenated to form sub-features.

循环执行这几步多次，迭代次数取决于给定的旋转的数目。最后，将不同坐标轴的子特征级联形成最终的RoPS描述子。The loop executes these steps multiple times, the number of iterations depends on the number of rotations given. Finally, the sub-features of different coordinate axes are concatenated to form the final RoPS descriptor.

将局部表面的形状或颜色信息加入RoPS，对编码信息进行扩展和改进，生成一种S/C-RoPS描述子，算法的框图如图4所示，特征匹配的准确度得到了优化。Add the shape or color information of the local surface to RoPS, expand and improve the coding information, and generate a S/C-RoPS descriptor. The block diagram of the algorithm is shown in Figure 4, and the accuracy of feature matching is optimized.

本专利采用一种基于置信度的决策层融合算法对S-RoPS描述子和C-RoPS描述子进行数据信息融合。具体思路是单独使用S-RoPS或C-RoPS描述子进行物体识别，这样可以获得每个单模式方法下的最高置信度，融合策略是对两种独立方法所生成的所有候选模型结果的置信度进行比较，选择具有最高置信度的候选模型。This patent adopts a confidence-based decision-level fusion algorithm to fuse the data information of the S-RoPS descriptor and the C-RoPS descriptor. The specific idea is to use S-RoPS or C-RoPS descriptors alone for object recognition, so that the highest confidence under each single-mode method can be obtained, and the fusion strategy is the confidence of all candidate model results generated by two independent methods A comparison is made and the candidate model with the highest confidence is selected.

其中特征值匹配方法如下：The eigenvalue matching method is as follows:

本专利中使用基于阀值的特征匹配法。基于阈值的匹配模式下，如果两个描述子之间的距离小于设定的阈值，则表明两个特征一致匹配。In this patent, a threshold-based feature matching method is used. In the threshold-based matching mode, if the distance between two descriptors is less than the set threshold, it indicates that the two features match consistently.

阀值所使用的距离公式为表征两个物体聚类之间的差异(一个聚类由多个描述子集合构成)，即两个集合的几何中心加上它们每一维度的标准偏差的曼哈顿距离之和如式3和式5：The distance formula used by the threshold is to characterize the difference between two object clusters (a cluster is composed of multiple descriptor sets), that is, the geometric center of the two sets plus the Manhattan distance of the standard deviation of each dimension The sum is as formula 3 and formula 5:

std_B的计算与std_A类似，n代表特征描述子的大小The calculation of std _B is similar to std _A , n represents the size of the feature descriptor

${L L}_{11} ((a a,, b b)) = = {Σ Σ}_{i i = = 11}^{n no} | | a a ((i i)) - - b b ((i i)) | | - - - - - - ((55))$

步骤三、选取合适的位置机械臂，并建立坐标系R,求取K坐标系下的位姿；通过坐标变换和坐标系变换将K下的位置和姿态信息转换为坐标系R下的坐标和姿态信息(物体坐标系O是为了求取姿态而产生的临时变量，没有实际意义非原点的点故而是K到R而非O到R)，生产XML语义地图。Step 3. Select the appropriate position of the robot arm, establish the coordinate system R, and obtain the pose in the K coordinate system; through coordinate transformation and coordinate system transformation, convert the position and attitude information in K into coordinates and Attitude information (the object coordinate system O is a temporary variable generated for the purpose of obtaining the attitude. It has no practical meaning and is a point other than the origin, so it is from K to R instead of O to R), and an XML semantic map is produced.

选取合适的位置放置机械臂，如图5所示建立坐标系R，坐标系K原点在坐标系R中的坐标为(d,l,h),利用PCA法建立物体坐标系O，经过两次坐标系变换，以及一次对K坐标系下的坐标变换，求出R坐标系下对应的位姿信息。再生产XML语义地图。具体流程如图6。Select a suitable position to place the robot arm, establish the coordinate system R as shown in Figure 5, the coordinates of the origin of the coordinate system K in the coordinate system R are (d, l, h), use the PCA method to establish the object coordinate system O, after two Coordinate system transformation, and a coordinate transformation in the K coordinate system to obtain the corresponding pose information in the R coordinate system. Reproduction of XML Semantic Maps. The specific process is shown in Figure 6.

1)计算物体点云的几何中心i代表点数量，对所有点集去中心化将去中心化后的所有点的坐标排列成3×N的矩阵1) Calculate the geometric center of the object point cloud i represents the number of points, decentralized for all point sets Arrange the coordinates of all points after decentralization into a 3×N matrix

$A A = = [\begin{matrix} {x x}_{11} & {x x}_{22} & ... ... & {x x}_{n no} \\ {y the y}_{11} & {y the y}_{22} & ... ... & {y the y}_{n no} \\ {z z}_{11} & {z z}_{22} & ... ... & {z z}_{n no} \end{matrix}];; - - - - - - ((66))$

2)令M＝A·A^T，计算M的特征值与特征向量：λ_i·V_i＝M·V_i,i＝1,2,3，并将特征向量正规化||V_i||＝1，物体的长轴方向对应最大特征值的特征向量，设λ₁≤λ₂≤λ₃，则可得物体坐标系相对坐标系K的旋转矩阵平移矩阵即为物体点云的几何中心则物体坐标系在坐标系K下的位姿如式7：2) Let M=A· ^AT , calculate the eigenvalues and eigenvectors of M: λ _i ·V _i =M·V _i , i=1,2,3, and normalize the eigenvectors ||V _i || =1, the long axis direction of the object corresponds to the eigenvector of the largest eigenvalue, if λ ₁ ≤λ ₂ ≤λ ₃ , then the rotation matrix of the object coordinate system relative to the coordinate system K can be obtained The translation matrix is the geometric center of the object point cloud Then the pose of the object coordinate system under the coordinate system K is as in formula 7:

${T T}_{mod mod}^{c c a a m m} = = [\begin{matrix} {R R}_{mod mod}^{c c a a m m} & {P P}_{mod mod}^{c c a a m m} \\ 00_{11 \times \times 33} & 11 \end{matrix}] - - - - - - ((77))$

设^camC＝{P_i}，则 ^modC表示模型库物体坐标系下的点云，依据长轴和中心点，确定短轴和次长轴平面，再依据平面点的极值分布确定短轴和次长轴方向。Suppose ^cam C＝{P _i }, then ^mod C represents the point cloud under the object coordinate system of the model library. According to the major axis and the center point, determine the plane of the minor axis and the second major axis, and then determine the direction of the minor axis and the minor axis according to the extreme value distribution of the plane points.

在匹配阶段，为了得到实际物体到模型库物体的变换矩阵，采用三点法计算六自由度位姿，对两相对应的三维点集合{^modP}，{^objP}，若满足刚体变换关系其中为两点集的旋转矩阵和平移向量，利用最小二乘法求解最优解，得到使公式8中E最小时的和 In the matching stage, in order to obtain the transformation matrix from the actual object to the model library object, the three-point method is used to calculate the six-degree-of-freedom pose. For two corresponding three-dimensional point sets { ^mod P}, { ^obj P}, if the rigid body transformation relationship is satisfied in is the rotation matrix and translation vector of the two point sets, using the least squares method to solve the optimal solution, and obtain the minimum E in formula 8 and

$E E. = = {Σ Σ}_{i i = = 11}^{n no} {| | (({R R}_{o o b b j j}^{mod mod} \cdot &Center Dot; {P P}_{o o b b j j} + + {t t}_{o o b b j j}^{mod mod})) - - {P P}_{mod mod} | |}^{22} - - - - - - ((88))$

则实际物体到模型库物体的变换矩阵如式9：Then the transformation matrix from the actual object to the model library object is shown in formula 9:

${T T}_{o o b b j j}^{mod mod} = = [\begin{matrix} {R R}_{o o b b j j}^{mod mod} & {t t}_{o o b b j j}^{mod mod} \\ 00_{11 \times \times 33} & 11 \end{matrix}] - - - - - - ((99))$

实际物体到传感器坐标系的位姿矩阵如式10：The pose matrix from the actual object to the sensor coordinate system is shown in Formula 10:

${T T}_{o o b b j j}^{c c a a m m} = = {T T}_{o o b b j j}^{mod mod} \cdot &Center Dot; {T T}_{mod mod}^{c c a a m m} - - - - - - ((1010))$

旋转矩阵可转化为偏转角α、俯仰角β、翻滚角γ描述其姿态如公式11，平移矩阵可转化为中心坐标描述其位置。The rotation matrix can be converted into yaw angle α, pitch angle β, and roll angle γ to describe its attitude as in formula 11, and the translation matrix can be converted into center coordinates to describe its position.

$β β = = A A t t a a n no 22 ((- - {r r}_{3131},, \sqrt{{r r}_{1111}^{22} + + {r r}_{21 twenty one}^{22}}))$

$α α = = A A t t a a n no 22 ((\frac{{r r}_{21 twenty one}}{cos cos β β},, \frac{{r r}_{1111}}{cos cos β β}))$

$γ γ = = A A t t a a n no 22 ((\frac{{r r}_{3232}}{cos cos β β},, \frac{{r r}_{3333}}{cos cos β β})) - - - - - - ((1111))$

其中r_ij代表旋转矩阵i行j列对应的元素。Where r _ij represents the element corresponding to row j of the rotation matrix i.

坐标系R与坐标系K的关系如图5，两者的变换矩阵如式12。The relationship between the coordinate system R and the coordinate system K is shown in Figure 5, and the transformation matrix of the two is shown in Formula 12.

$[\begin{matrix} x x \\ y the y \\ z z \end{matrix}] = = [\begin{matrix} 00 & s the s i i n no θ θ & - - c c o o s the s θ θ \\ 11 & 00 & 00 \\ 00 & - - cos cos θ θ & - - sin sin θ θ \end{matrix}] [\begin{matrix} {x x}_{k k} \\ {y the y}_{k k} \\ {z z}_{k k} \end{matrix}] + + [\begin{matrix} d d \\ l l \\ h h \end{matrix}] - - - - - - ((1212))$

其中，θ代表Kinect相对水平面的倾斜角，{x,y,z}为物体在坐标系R下的坐标值，{x_k,y_k,z_k}为物体在坐标系K下的坐标值。Among them, θ represents the inclination angle of Kinect relative to the horizontal plane, {x, y, z} is the coordinate value of the object in the coordinate system R, and {x _k , y _k , z _k } is the coordinate value of the object in the coordinate system K.

物体到坐标系R的姿态矩阵如下；The attitude matrix of the object to the coordinate system R is as follows;

${T T}_{o o b b j j}^{r r o o b b} = = {T T}_{c c a a m m}^{r r o o b b} \cdot &Center Dot; {T T}_{o o b b j j}^{c c a a m m} - - - - - - ((1313))$

其中in

${T T}_{c c a a m m}^{r r o o b b} = = [\begin{matrix} 00 & sin sin θ θ & - - c c o o s the s θ θ \\ 11 & 00 & 00 \\ 00 & - - sin sin θ θ & - - c c o o s the s θ θ \end{matrix}] - - - - - - ((1414))$

第二部分、语音人机交互，其中包括以下步骤：The second part, voice human-computer interaction, includes the following steps:

步骤一、用户发出语音命令，经处理将其转变为文本信息。Step 1. The user issues a voice command, which is converted into text information after processing.

接收到用户的语音后，经过预处理和语音解码最终得到文本信息，具体流程如图7：After receiving the user's voice, the text information is finally obtained after preprocessing and voice decoding. The specific process is shown in Figure 7:

步骤二、将文本信息和XML语义地图输入智能推理机，推理机产生执行指令并输出文本信息；Step 2. Input the text information and XML semantic map into the intelligent inference engine, and the inference engine generates execution instructions and outputs text information;

用户通过语音控制三维地图实时生成模块构建当前场景的语义地图文件，语音识别和语音合成节点分别通过发送和接收文本来实现人机对话，智能推理机节点则结合地图文件进行分析和反馈信息，通过深度对话完善用户期望最终生成解决方案并发送给方案解析与运动规划模块。语音识别使用的是PocketSphnix开源语音识别系统，语音合成使用的是Ekho开源语音合成系统。The user controls the real-time generation module of the 3D map by voice to construct the semantic map file of the current scene. The speech recognition and speech synthesis nodes realize man-machine dialogue by sending and receiving text respectively, and the intelligent reasoning machine node combines the map file for analysis and feedback information. In-depth dialogue improves user expectations and finally generates a solution and sends it to the solution analysis and motion planning module. Speech recognition uses the PocketSphnix open source speech recognition system, and speech synthesis uses the Ekho open source speech synthesis system.

应当理解的是，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that those skilled in the art can make improvements or changes based on the above description, and all these improvements and changes should belong to the protection scope of the appended claims of the present invention.

Claims

1. one kind based on Kincet and the man-machine interaction method of voice, it is characterised in that comprise the following steps:

1) use Kinect sensor obtain each object in scene at coordinate system K exact space position and attitude information, complete Object detection and recognition；Described coordinate system K is with kinect geometric center as initial point, to be perpendicular to camera lens outwardly direction as Z Axle positive direction, with the line in the center of circle of tri-camera lenses of Kincet as X-axis, crosses and sets up coordinate system；

2) depth image gathered respectively by Kinect and RGB obtain three dimensional point cloud after fusion treatment；

3) spatial point cloud object identification: three dimensional point cloud is carried out process and obtains semantic description file；

4) object coordinates system O is carried out the three-dimensional scenic semanteme map that coordinate transform obtains under coordinate system R and describes file；

5) receive user speech input, input signal is processed, obtains text message；

6) text message and XML semanteme map being inputted intellgence reasoning machine, inference machine produces and performs instruction and export user's Reply and the text message of guidance information.

The most according to claim 1 based on Kincet with the man-machine interaction method of voice, it is characterised in that described step 3) Spatial point cloud object identification process includes that pretreatment, key point are extracted, described son extraction, then is carried out by object features data base Characteristic matching, finally obtains semantic description file.

The most according to claim 1 based on Kincet with the man-machine interaction method of voice, it is characterised in that described step 3) In:

3.1) pretreatment, described pre-treatment step is for filtering the cloud data that range sensor is too far away or too close；

3.2) using ISS algorithm that cloud data is carried out feature point detection, detailed process is as follows:

3.2.1) each some p in inquiry input cloud data_iRadius r_frameInterior had a p_j, and calculate weight according to formula 1；

3.2.2) covariance matrix is calculated according to weight according to formula 2

C O V (p_{i}) = \underset{| p_{i} - p_{j} | < r_{f r a m e}}{Σ} w_{i j} (p_{i} - p_{j}) {(p_{i} - p_{j})}^{T} / \underset{| p_{i} - p_{j} | < r_{f r a m e}}{Σ} w_{i j} - - - (2)

3.2.3) eigenvalue of covariance matrix is calculatedAnd eigenvalue is arranged according to descending order；

3.2.4) rate threshold γ is set₂₁And γ₃₂, retain and meetWithPoint set, these point be Key feature points；

3.3) Feature Descriptor of key feature points calculates, and concrete grammar is as follows:

First pass through calculate the covariance matrix of point being positioned at key point neighborhood local surfaces build one unique, clear and definite With stable local referential system LRF, using key point as starting point, rotate local surfaces until LRF and object coordinates system O Ox, Oy and Oz axle alignment, so can make that a little there is rotational invariance；

Then to each axle Ox, Oy, Oz perform following several steps, we using these axles as current axis:

3.3.1) local surfaces rotates around current axis with specified angle；

3.3.2) the local surfaces spot projection rotated is in XY, XZ and YZ plane；

3.3.3) setting up projective distribution matrix, this matrix only shows the quantity of the point that each subdomain comprises, the quantity of subdomain Represent the dimension of matrix, the same with specified angle it be also the parameter of this algorithm；

3.3.4) distribution matrix centre-to-centre spacing, i.e. μ are calculated₁₁、μ₂₁、μ₁₂、μ₂₂And e；

3.3.5) the value cascade composition subcharacter calculated；

Circulation performs above-mentioned steps, and iterations depends on the number of the rotation given；Finally, by the subcharacter of different coordinate axess Cascade forms final RoPS and describes son；

3.4) eigenvalue coupling, concrete grammar is as follows:

This patent uses characteristic matching method based on threshold values, under match pattern based on threshold value, if two describe between son Distance less than set threshold value, then show that two features are unanimously mated；

The range formula that threshold values is used is to characterize the difference between two object clusters, and the geometric center of i.e. two set adds The manhatton distance sum of the standard deviation of they every dimension, such as formula (3) and formula (5):

D (A, B)=L₁(C_A,C_B)+L₁(std_A,std_B) (3)

Wherein, D (A, B) represents the range difference of two i.e. A and B of object cluster, C_A(i),C_BI () is respectively in A, B dimension The heart, L1 represents manhatton distance formula, std_AI () represents the standard deviation of cluster A dimension, std_BI () represents cluster B The standard deviation of dimension；

{std}_{A} (i) = \sqrt{\frac{1}{| A |} Σ_{j = 1}^{| A |} {(a_{j} (i) - C_{A} (i))}^{2}}, i = 1, ..., n - - - (4)

N representative feature describes the size of son；

Two L describing sub-a and b₁Distance is as follows:

L_{1} (a, b) = Σ_{i = 1}^{n} | a (i) - b (i) | - - - (5),

a_jI () represents the RoPS of jth key point in A cluster and describes the value of the i dimension of son；

| A | represents the quantity of key point in cluster A；

| B | represents the quantity of key point in cluster B.

The most according to claim 1 based on Kincet with the man-machine interaction method of voice, it is characterised in that described step 4) In, specific as follows: to choose suitable position placement machine mechanical arm, set up coordinate system R, coordinate system O initial point coordinate in coordinate system R For (d, l h), utilize PCA method to set up object coordinates system O, through twice coordinate transform, obtain semantic description file under R coordinate system Corresponding posture information, reproduction XML semanteme map.

The most according to claim 1 based on Kincet with the man-machine interaction method of voice, it is characterised in that described step 5) Speech recognition process specifically includes following steps:

5.1) pretreatment: gather user speech information by microphone array, processes the primary speech signal of input, filter Remove unessential information therein and background noise, the end-point detection of lang tone signal of going forward side by side, voice framing and pre-add Heavily process；

5.2) feature extraction: the key characterization parameter extracting reflection phonic signal character forms feature vector sequence；

5.3) HMM (HMM) is used to carry out acoustic model modeling, by voice to be identified during identifying Mate with acoustic model, thus obtain recognition result；

5.4) training text data base is carried out grammer, semantic analysis, through obtaining N-Gram language based on statistical model training Model, thus improve discrimination, reduce hunting zone.

5.5) for the voice signal of input, one is set up according to oneself trained good HMM acoustic model, language model and dictionary Identifying network, find an optimal paths according to searching algorithm in the network, this path is exactly can be with maximum of probability Export the word string of this voice signal, so that it is determined that the word that this speech samples is comprised.