CN110598510B

CN110598510B - Vehicle-mounted gesture interaction technology

Info

Publication number: CN110598510B
Application number: CN201810606708.XA
Authority: CN
Inventors: 周秦娜
Original assignee: Shenzhen Point Cloud Intelligent Technology Co ltd
Current assignee: Shenzhen Point Cloud Intelligent Technology Co ltd
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2023-07-04
Anticipated expiration: 2038-06-13
Also published as: CN110598510A

Abstract

A vehicle-mounted gesture interaction technology, comprising the following steps: (1) Identifying a moving object using an improved moving object detection algorithm; (2) And (3) judging whether the moving object identified in the step (1) is a human palm or not by using a gesture identification control method. Wherein the improved moving object detection algorithm comprises the steps of: 2.1, initializing; 2.2 detecting whether the pixel point is a motion point; 2.3, carrying out kmeans clustering on the motion points; 2.4 region growth; extracting a region; and 2.5 updating the pixel point. The gesture recognition control method comprises the following steps: 3.1, feature selection and model training; 3.2 judging whether the target image is a human palm. The feature selection and model training comprises the following steps: 3.1.1 collecting training data; 3.1.2 selecting sample points from the data to be trained; 3.1.3, calculating optimal division values of all the sample points; and 3.1.4, establishing a random forest corresponding to the sample point based on the optimal division value calculation result.

Description

Vehicle-mounted gesture interaction technology

Technical Field

The invention relates to the field of image recognition and processing, in particular to a vehicle-mounted gesture interaction technology.

Background

With the progress of science and technology, the functions of automobiles are increasing, the internal information systems are becoming more complex, and the operations are also becoming more complex for users. The operation of the conventional automobile buttons and touch screens requires the simultaneous use of eyes and hands, which has an impact on driving safety. Although the voice interaction mode is quick, the voice recognition is not accurate enough because the noise of the running vehicle is large and the interference is large.

Inside the automobile, use the gesture to come with the car and interact, be equivalent to traditional car button or pronunciation interactive mode, have quick, accurate, safety, and the strong advantage of interference killing feature.

The camera used in the traditional vehicle-mounted gesture interaction technology is an rgb camera, and the camera is obtained through the skin color of a human hand, but the method has limitations, such as dark hands, dark lights or at night, or the color of a seat in the vehicle has great interference on gesture recognition of the rgb camera. The invention adopts the depth camera to detect the moving object based on the basic principle of motion detection, refers to the motion detection algorithm of the traditional rgb camera, improves on the basis, and can better detect the moving object.

The current common scheme of extracting the palm by the depth camera is a scheme based on a depth threshold, namely, if an object is larger than a distance value, the object is discarded. However, in an actual driving process, the hands of the user are usually located in the middle or below the steering wheel, but the user cannot be required to raise the handle beyond the steering wheel when using the product, and in addition, the movement of the user and other moving objects can cause interference, so how to determine whether one moving object is a palm of a person is difficult. In the invention, the camera shoots from the top down, an object moving in driving, such as a steering wheel, a human body, a head and a shoulder, and a human hand can appear at any position of the camera.

Drawings

The specification and the drawings show the main steps of the technical scheme.

Fig. 1 shows two general parts of the present technical solution, providing a vehicle gesture interaction technique: identifying a moving object using an improved moving object detection algorithm; and then judging whether the identified moving object is a human palm or not by using a gesture identification control method.

Fig. 2 shows the main steps of the improved moving object detection algorithm used in the present solution to identify moving objects.

Fig. 3 shows a main step of determining whether a recognized moving object is a human palm according to a gesture recognition control method used in the present technical solution.

Fig. 4 shows the main steps of feature selection and model training. and

Fig. 5 shows the main steps of determining whether the target image is a human palm.

Disclosure of Invention

The invention provides a vehicle-mounted gesture interaction technology, which comprises the following steps of: (1) Identifying a moving object using an improved moving object detection algorithm; (2) And (3) judging whether the moving object identified in the step (1) is a human palm or not by using a gesture identification control method.

The moving object detection algorithm comprises the following steps: initializing; detecting whether the pixel point is a motion point or not; carrying out kmeans clustering on the motion points; growing a region; extracting a region; and updating the pixel point.

The gesture recognition control method comprises the following steps: feature selection and model training; and judging whether the target image is a human palm.

Further, the gesture recognition control method is a part of the step of updating the pixels in the moving object detection algorithm, judges whether the target image is a hand of a person, if so, updates the history record information of the pixels, thereby increasing the depth information change of the moving points, and if not, the information is kept unchanged, thereby increasing the change of the depth information of the moving points, so as to extract the moving pixels more effectively next time.

The invention combines the depth camera with the gesture technology, and the depth camera can solve the interference of illumination, skin color and ornaments in the vehicle. The motion detection algorithm of the traditional rgb camera is referred to, innovation is carried out on the basis of the motion detection algorithm, and a moving object can be better detected. According to the method, the depth camera shoots from the top to the bottom of the vehicle roof, the gesture recognition control method adopts random forest training of machine learning, and innovation is carried out in the feature selection step of the decision tree, so that whether a target image is a hand or not is judged.

Detailed Description

In order to further explain the technical scheme, the depth camera is combined with the gesture technology, an improved moving object detection algorithm is used for detecting a moving object, then a gesture recognition control method is used for judging whether the recognized moving object is a human palm, and the specific implementation mode is described below with reference to the attached drawings.

Further, the improved moving object detection algorithm is characterized in that in step 2.1, a continuous tens of frames of depth maps are obtained through a depth camera, and a historical record library is created for each pixel point.

Further, the improved moving object detection algorithm is characterized in that, in step 2.2, for each pixel point, the image obtained by the camera for each frame is detected whether the point is a moving pixel point, and specifically includes the following steps: 2.2.1. setting a counter a to 0;2.2.2. calculating the difference between the current depth value of the pixel point and the depth value in the history record library, and adding 1 to the counter a if the difference is larger than a certain set threshold value; 2.2.3. after step 2.2.2 is performed on each history of the pixel in the history repository, if the value of the counter a is greater than a threshold value, the pixel is set as a motion point.

Preferably, the certain set threshold is not uniquely fixed and can be adjusted according to actual needs; the threshold value compared with the value of the counter a is not uniquely fixed and can be adjusted according to actual needs.

Further, the improved moving object detection algorithm is characterized in that after all the moving points are obtained, step 2.3 is executed, and kmeans clustering is performed on all the moving points, and specifically comprises the following steps: 2.3.1, selecting a part of pixel points from all the pixel points at will as an initial clustering center; 2.3.2 for the remaining other pixels, assigning them to clusters most similar to them according to their similarity (distance) to the cluster center described in 2.3.1, respectively; 2.3.3 recalculating the cluster center of each obtained new cluster, namely calculating the average value of all objects in the new cluster; 2.3.4 calculating a standard degree function, when certain conditions are met, if the function converges, the algorithm is terminated, otherwise, the steps 2.3.2, 2.3.3 and 2.3.4 are recursively executed to obtain some categories; 2.3.5 step 2.3.4, wherein each category has a pixel center point and a motion pixel point corresponding to the pixel center point, a category element number threshold is set, categories which do not reach the threshold are removed, and then a category serial number is allocated to each motion point.

Preferably, the threshold value of the number of the category elements is not uniquely fixed, and can be adjusted according to actual needs.

Further, the improved moving object detection algorithm is characterized in that after the category is obtained, step 2.4 is executed to perform region growing, and specifically includes the following steps: 2.4.1 comparing the depth value which has been detected as a motion point with a new pixel point to be detected nearby, if the difference between the depth values is smaller than a set threshold value, the new pixel point to be detected is similar to the pixel point which has been detected as a motion point, thereby setting the new pixel point as a motion point; 2.4.2 according to step 2.4.1, if the new pixel point is judged as a new motion point in both categories, the two categories are similar in attribute, so that the two categories are combined, and the category serial number is set to the same category serial number until all motion point detection is completed.

Preferably, the set threshold value compared with the depth difference value is not uniquely fixed, and can be adjusted according to actual needs.

Further, the improved moving object detection algorithm is characterized in that after all the moving points are detected, whether the picture extracted in the step 2.5 is a human hand is judged, if the picture is a hand, the historical record information of the pixel points is updated, so that the depth information change of the moving points is increased, and if the picture is not a hand, the information is kept unchanged.

Fig. 3 shows a main step of determining whether a recognized moving object is a human palm according to a gesture recognition control method used in the present technical solution. The method is characterized by comprising the following steps of: 3.1, feature selection and model training; 3.2 judging whether the target image is a human palm.

Further, the step 3.1 includes the following steps: 3.1.2 selecting sample points from the data to be trained; 3.1.3, calculating optimal division values of all the sample points; and 3.1.4, establishing a random forest corresponding to the sample point based on the optimal division value calculation result.

Further, before step 3.1.2, step 3.1.1: and obtaining the images to be trained through at least two cameras, wherein at least one camera is a depth camera and at least one camera is an rgb camera. The purpose of this step is to collect training data.

Preferably, the camera shoots from the top to the bottom of the vehicle roof, and a person to be recorded wears blue gloves with both hands, and freely takes the palm of the user in the vehicle to make various gestures and actions, including actions during driving, steering wheel, hand brake and the like. The blue pixel point position of the rgb camera is extracted and utilized, so that the region of the camera hand with the corresponding depth is obtained, and the data marking of the palm is realized.

Further, in the gesture recognition control method, in the step 3.1.2, a palm portion in the depth map is used as a positive sample, a non-palm portion is used as a negative sample, and the same number of pixels are randomly selected from the positive sample portion and the negative sample portion to be used as sample points to be trained.

Further, the gesture recognition control method is characterized in that the calculation of the optimal score value in the step 3.1.3 includes the following steps: 3.1.3.1 calculating the depth average value of each neighborhood of the sample point; 3.1.3.2 calculating the difference value between the depth average value of each neighborhood of the sample point and the depth value of the sample point; 3.1.3.3 calculating information entropy; and 3.3.4, obtaining the optimal score value.

Further, the gesture recognition control method is characterized in that the depth average calculation in the step 3.1.3.1 includes the following steps: 3.1.3.1.1 randomly selecting a certain sample point P;3.1.3.1.2 calculating an average value of depth values of a square neighborhood centered on P; 3.1.3.1.3 calculates the average value of the depth values of the neighborhood of all sample points based on the calculation method of step 3.1.3.1.2.

Further, the gesture recognition control method is characterized in that in the step 3.1.3.1.2, the average value calculation method of the square neighborhood depth value with P as the center is as follows: the size of the neighborhood of the point P is 3, 5, 7 and 9.2n+1 in sequence, n is the number of pixels on one side of the square neighborhood, and P is the center point of the square neighborhood, and if the neighborhood is partially beyond the depth map, only the average value of the depth values of the points which do not exceed the range is calculated.

Further, the gesture recognition control method is characterized in that the step of calculating the information entropy in the step 3.1.3.3 is as follows: 3.1.3.3.1 dividing all positive and negative sample points into a plurality of equal parts, wherein each equal part contains positive and negative sample points in the same proportion; 3.1.3.3.2 for all sample points in an aliquot, when the neighborhood is 3, there is a difference d from the 3 neighborhood depth mean; 3.1.3.3.3 each difference d can divide the difference into two parts, one part being larger than d and one part being smaller than d;3.1.3.3.4 obtaining a final information entropy s according to a calculation formula of the information entropy; 3.1.3.3.5 based on the calculation method of steps 3.1.3.1.2-3.1.3.1.4, corresponding information entropy can be obtained when the neighborhood is 5, 7 and 9.

Preferably, the definition of the information entropy is: in a source, what is considered is not the uncertainty of the occurrence of a single symbol, but the average uncertainty of all possible occurrences of the source. If the source symbol has n values: u (U) ₁ …U _i …U _n The corresponding probabilities are: p (P) ₁ …P _i …P _n And the occurrence of the various symbols is independent of each other. At this time, the average uncertainty of the source should be a statistical average (E) of the single symbol uncertainty-log Pi, which can be called information entropy, i.e

Where the logarithm is typically taken as the base of 2 in bits. However, other logarithmic bases can be selected and other corresponding units can be adopted, and the units can be converted by using a base-changing formula.

Further, by way of example, for a 3-neighborhood, 100 sample points, 100 differences may be calculated, denoted (d 1, d2, d3 …, d 100), from the 100 differences, k values (0 < k < 100) are randomly selected, and for each selected difference as a division value, a score of entropy dividing the differences by the division value may be obtained according to the definition of entropy, denoted S. (assuming that 30 points on the left and 70 points on the right after division, the calculated information entropy is s= -1 (0.3×log (0.3) +0.7×log (0.7)).

Further, the step of obtaining the optimal score value in the step 3.1.3.4 includes: 3.1.3.4.1 selecting the largest value of the information entropy corresponding to all sample points from the 3 neighborhood, namely S3, as the score of the 3 neighborhood, and recording the D value of the division at the moment as D3;3.1.3.4.2 based on step 3.1.3.4.1, when the neighborhood is 5, the score is S5 and the D value is D5; when the neighborhood is 7, the score is S7, and the D value is D7; when 2n+1 is the neighborhood, the score is S (2n+1), and the D value is D (2n+1); 3.1.3.4.3 the S value with the largest score is marked as Sm, the corresponding neighborhood is marked as m, and the corresponding d is marked as Dm.

Further, in the step 3.1.4, a random forest corresponding to the sample points is established by the following steps: 3.1.4.1 constructing a decision tree based on the optimal partition neighborhood and the optimal partition value; 3.1.4.2 based on the decision tree, the random forest is built.

Preferably, each decision tree is a binary pair, and a decision forest is formed by a plurality of decision trees, and each decision tree can be trained by using extracted pixels or different pixels.

Further, in the step 3.1.4.1, the step of constructing a decision tree includes: 3.1.4.1.1 storing the optimal neighborhood m and the optimal score value Dm obtained in the step 3.1.3.4.3 on a root node of a decision tree, (m, dm); 3.1.4.1.2 dividing an aliquot of sample points into two parts based on the optimal division value stored on the root node, wherein d is a point larger than Dm on the left side, and d is a point smaller than Dm on the right side; 3.1.4.1.3 recursively performs the contents of steps 3.1.3.3 and 3.1.3.4 and 3.1.4.1 for the left and right partial points until the class of the left and right subtrees is either only positive or only negative, or the maximum depth of the tree is reached; 3.1.4.1.4 when the maximum depth is reached, the leaf nodes store the number of positive and negative sample points, thereby forming a decision tree.

Further, in the step 3.1.4.2, a decision tree may be formed for each aliquot of sample points, based on which a random forest may be formed.

Preferably, the step feature selection and the model training are completed offline, and those skilled in the art understand that when the model training is completed, the model training is not required to be performed again each time the target image needs to be predicted, and the judgment is performed based on the result of the model training.

Further, in the step 3.2, it is determined whether the target image is a human palm, and the steps may be further divided into the following steps: 3.2.1 judging whether each point of the target image is a point of a human palm or not based on the random forest obtained by the model training; 3.2.2 determining whether the target image is a human palm based on the determination result in the step 3.2.1.

Further, in the step 3.2.1, the steps may be further divided into the following steps: 3.2.1.1 calculating depth difference: for each pixel point, finding a decision tree, calculating the depth average value of the optimal neighborhood m stored on the root node of the decision tree, and calculating the difference value between the depth average value and the depth value of the pixel point; 3.2.1.2 recursively the decision tree: comparing the difference value calculated based on the step 3.2.1.1 with the optimal division value Dm stored in the node, if the difference value is smaller than Dm, carrying out left branch recursion, and if the difference value is larger than Dm, carrying out right branch recursion, sequentially carrying out recursion until the recursion reaches a leaf node, and storing the number of positive and negative samples on the leaf node; 3.2.1.3, counting the positive and negative sample numbers of all trees of the point in the random forest and judging: for the pixel point, counting to obtain the positive and negative sample numbers of all trees of the point in the random forest, and if the total positive sample number is larger than the negative sample number, the pixel point is a hand; if the total positive number of samples is less than the negative number of samples, the pixel is not a hand.

Further, the method of determining whether the target image is a human hand in step 3.2.2 is that after step 3.2.1 is performed on each point in the target image, the number of pixels predicted to be a hand and the number of pixels not to be a hand are counted, and if the number of pixels predicted to be a hand is greater than the number of pixels not to be a hand, the target image is a hand.

Preferably, for the pixel point judged to be the hand, the history information in the history record library is updated, and for the point judged not to be the hand, the record in the history record library is kept unchanged, so that the depth information change of the motion point is increased, and the motion pixel point is extracted more effectively next time.

The technical scheme is a specific implementation mode, and the problem of interaction between a driver and an automobile in the existing automobile driving process can be solved through the technical scheme.

Claims

1. A vehicle-mounted gesture interaction control method comprises the following steps:

(1) Identifying a moving object using an improved moving object detection algorithm; the step for identifying a moving object includes the steps of:

1.1 Initializing;

1.2 Detecting whether the pixel point is a motion point or not;

1.3 Carrying out kmeans clustering on the motion points;

1.4 Growing a region;

1.5 Extracting a region;

1.6 Updating the pixel points;

(2) Judging whether the moving object identified in the step (1) is a human palm or not by using a gesture identification control method;

2.1 Feature selection and model training;

2.1.1 Collecting training data, and obtaining images to be trained through at least two cameras, wherein at least one camera is a depth camera and at least one camera is an RGB camera;

2.1.2 Selecting sample points from data to be trained, taking a palm part in a depth map as a positive sample, taking a non-palm part as a negative sample, and randomly selecting the same number of pixel points from the positive sample part and the negative sample part as the sample points to be trained;

2.1.3 Calculating optimal division values of all the sample points; the optimal score value calculation comprises the following steps:

2.1.3.1 Calculating the depth average value of each neighborhood of the sample point;

a certain sample point P is randomly selected, the average value of the depth values of the square neighborhood centering on the P is calculated, and the average value of the depth values of the neighborhood of all sample points is calculated. The average value calculation method of the square neighborhood depth value with P as the center in the step is as follows: the size of the neighborhood of the point P is 3, 5, 7 and 9.2n+1 in sequence, n is the number of pixels on one side of the square neighborhood, P is the center point of the square neighborhood, and if part of the neighborhood exceeds the depth map, only the average value of the depth values of the points which do not exceed the range is calculated;

2.1.3.2 Calculating the difference value between the depth average value of each neighborhood of the sample point and the depth value of the sample point;

2.1.3.3 Calculating information entropy;

dividing all positive and negative sample points into a plurality of equal parts, wherein each equal part contains positive and negative sample points with the same proportion, when the neighborhood is 3 for all sample points in one equal part, each difference d is a difference d from the depth average value of the 3 neighborhood, each difference d can divide the difference into two parts, one part is larger than d, and one part is smaller than d, the final information entropy s is obtained according to the calculation formula of the information entropy, and the corresponding information entropy can be obtained in sequence when the neighborhood in 2.1.3.1 is 5, 7 and 9;

2.1.3.4 Obtaining an optimal dividing value;

for the 3 neighborhood, selecting the largest information entropy value from the information entropy values corresponding to all sample points, marking the largest information entropy value as S3 and the largest information entropy value as the score of the 3 neighborhood, marking the D value divided at the moment as D3, and sequentially obtaining the score as S5 and the D value as D5 when the neighborhood is 5 based on the steps; when the neighborhood is 7, the score is S7, and the D value is D7; when the neighborhood is 2n+1, the score is S (2n+1), the D value is D (2n+1), the S value with the largest score is selected and is marked as Sm, the corresponding neighborhood is marked as m, and the corresponding D is marked as Dm;

2.1.4 Establishing a random forest corresponding to the sample point based on the optimal partition value calculation result, and specifically comprising the following steps:

2.1.4.1 Constructing a decision tree based on the optimal partition neighborhood and the optimal partition value;

storing the optimal neighborhood m and the optimal division value Dm obtained in the step 2.1.3.4 on a root node of a decision tree, (m, dm), dividing an equivalent sample point into two parts based on the optimal division value stored on the root node, wherein the left side is a point with d larger than Dm, and the right side is a point with d smaller than Dm; the contents of steps 2.1.3.3, 2.1.3.4 and 2.1.4.1 are recursively executed for the points of the left and right parts until the categories of the left and right subtrees are only positive samples or only negative samples, or the maximum depth of the tree is reached, and when the maximum depth is reached, leaf nodes store the number of positive and negative sample points, so that a decision tree is formed;

2.1.4.2 A decision tree can be formed for each equal sample point, a random forest can be formed based on the decision tree, and model training is completed;

2.2 Judging whether the target image is a human palm, further comprising: and judging whether each point of the target image is a point of a human palm based on the random forest obtained by the model training, and further judging whether the target image is the human palm.

2. The interactive control method according to claim 1, wherein the image acquired by the camera for each frame in step 1.2, for each pixel, detects whether the pixel is a moving pixel, specifically includes the following steps:

1.2.1. setting a counter a to 0;

1.2.2. calculating the difference between the current depth value of the pixel point and the depth value in the history record library, and adding 1 to the counter a if the difference is larger than a certain set threshold value;

1.2.3. after step 1.2.2 is performed on each history of the pixel in the history repository, if the value of the counter a is greater than a threshold value, the pixel is set as a motion point.

3. The control method according to any one of claims 1 or 2, characterized in that a historic record base is created for each pixel point by acquiring consecutive tens of frames of depth maps by means of a depth camera.

4. The control method according to any one of claims 1 or 2, characterized in that after obtaining all the motion points, step 1.3 is performed, and kmeans clustering is performed on all the motion points, and specifically comprises the following steps:

1.3.1 Selecting a part of pixel points from all the pixel points at will as an initial clustering center;

1.3.2 For the rest other pixel points, respectively distributing the rest other pixel points to the clusters most similar to the rest other pixel points according to the similarity between the rest other pixel points and the cluster center of 1.3.1;

1.3.3 Re-calculating the cluster center of each obtained new cluster, namely calculating the average value of all objects in the new cluster;

1.3.4 Calculating a standard function, when a certain condition is met, if the function is converged, terminating the algorithm, otherwise recursively executing the steps 1.3.2, 1.3.3 and 1.3.4 to obtain some categories;

1.3.5 And 1.3.4, setting a threshold value of the number of class elements in each class with a pixel center point and a motion pixel point corresponding to the pixel center point, removing the classes which do not reach the threshold value, and then distributing a class serial number for each motion point.

5. The control method according to claim 4, wherein after the category is obtained, step 1.4 is performed to perform region growing, and the method specifically comprises the steps of:

1.4.1 Comparing the depth value which has been detected as the motion point with a new pixel point to be detected nearby, and if the depth difference value of the depth value and the depth value is smaller than a set threshold value, setting the new pixel point as the motion point by making the new pixel point similar to the pixel point which has been detected as the motion point;

1.4.2 According to step 1.4.1, if the new pixel point is judged to be a new motion point in both categories, the two categories are similar in attribute, so that the two categories are combined, and the category serial number is set to be the same category serial number until all motion points are detected.

6. The control method according to claim 5, wherein it is determined whether the picture extracted in step 1.5 is a human hand, if it is determined to be a hand, the recorded information of the history of the pixel points is updated so as to increase the depth information change of the moving point, and if it is determined not to be a hand, the information is kept unchanged.

7. The control method according to claim 6, wherein the step 2.2 of determining whether the target image is a human palm includes the steps of:

2.2.1 Judging whether each point of the target image is a point of a human palm or not based on the random forest obtained by the model training;

calculating a depth difference value: for each pixel point, finding a decision tree in a training model, calculating the depth average value of the optimal neighborhood m stored on the root node of the decision tree, calculating the difference value between the depth average value and the depth value of the pixel point, comparing the calculated difference value with the optimal division value Dm stored in the node, if the calculated difference value is smaller than Dm, carrying out left branch recursion and right branch recursion, carrying out recursion in turn until the recursion reaches a leaf node, obtaining the number of positive and negative samples stored on the leaf node, counting the number of positive and negative samples of all trees of the point in a random forest, and if the total number of positive samples is larger than the number of negative samples, then the pixel point is a hand, otherwise, the pixel point is not the hand;

2.2.2 Judging whether the target image is a human palm or not based on the judgment result in the step 2.2.1;

after step 2.2.1 is performed on each point in the target image, counting the number of pixels predicted to be a hand and the number of pixels not to be a hand, if the number of pixels predicted to be a hand is more than the number of pixels not to be a hand, the target image is a hand, otherwise, the target image is not a hand.