CN112223288B

CN112223288B - Visual fusion service robot control method

Info

Publication number: CN112223288B
Application number: CN202011073216.2A
Authority: CN
Inventors: 段峰; 张丽娜
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-09-14
Anticipated expiration: 2040-10-09
Also published as: CN112223288A

Abstract

The invention belongs to the technical field of medical service robots, and particularly relates to a service robot control method with vision fusion. The method comprises the following steps: step S1, acquiring scene information of the service robot for analysis, and acquiring potential target objects and pixel positions thereof; step S2, sequencing the target objects according to the significance, sequentially displaying the target object pictures according to the sequencing by the human-computer interaction interface, and sequentially inquiring whether the user grabs or not through an inquiry window; step S3, recognizing the electroencephalogram of the user by adopting a brain-computer interface, recognizing the intention of the user, judging whether to grab an object of a human-computer interaction interface, if so, turning to step S4, and if not, continuing to display the next target object picture; and step S4, moving the service robot to the vicinity of the target object according to the target object selected by the user, and controlling the mechanical arm of the service robot by using the visual servo to complete grabbing. The service robot is controlled through the electroencephalogram signals under the assistance of vision, so that the visual burden and fatigue of a user can be reduced, required articles can be easily obtained, and the self-care level of a patient with serious dyskinesia can be improved.

Description

Visual fusion service robot control method

Technical Field

The invention belongs to the technical field of medical service robots, and particularly relates to a service robot control method with vision fusion.

Background

With the coming of global aging, patients with diseases such as cerebral apoplexy, Alzheimer's disease, Parkinson's disease, amyotrophic lateral sclerosis, spinal cord injury, muscular dystrophy and the like are increasing, and the diseases bring great inconvenience and burden to the lives of the patients. For these severely dyskinetic patients, it is essential to develop intelligent service devices to assist their daily lives. The proportion of the old people who cannot take care of themselves is high, so that the complex system operation has certain difficulty; speaking and gesturing are also difficult for them. The brain-computer interface can establish a bridge between human brain intention and external equipment, and the brain-computer interface has a high probability of becoming the optimal interaction mode between a patient with serious dyskinesia and service equipment. The brain-computer interface system based on the electroencephalogram signals can convert consciousness of the human brain into instructions without manual operation or voice commands and other ways, realizes communication between the human brain and external equipment, and is a service system realization mode capable of being explored.

In practical application scenarios, most of the existing brain-computer interface systems are not high enough in accuracy, and the number of control commands which can be realized is relatively limited; in addition, the brain-computer interface system cannot sense environmental spatial information, and the user can feel brain fatigue after long-time operation, thereby affecting the control effect. The brain-computer interface system is difficult to accurately and comfortably realize the complete operation of the service equipment, the intelligent level of the system is not enough to judge the operation intention of the user, and the operation burden of the user is difficult to avoid. An intelligent method and system are urgently needed in the field of medical service robots to help judge the intention of patients with severe dyskinesia, so that the medical service robots are assisted to complete equipment operation, and convenience is provided for life of the patients.

Disclosure of Invention

The control method of fusing computer vision is a good way for assisting a user to operate a system. Regions of interest are automatically processed and regions of non-interest are selectively ignored, mimicking a human being when facing a particular scene. Human vision has the capability of quickly searching and positioning interested targets, and a visual attention mechanism, namely visual saliency, can be introduced into service robot vision, so that the visual information processing task is greatly improved. Visual servoing techniques may also be introduced in the service robot. The visual servo technology refers to the behavior of automatically receiving and processing images through a visual sensor, and enabling the system to further control or adaptively adjust the robot through information fed back by the images. The visual servo technology is applied to the service robot, so that the control operation of a user on the robot can be simplified. Computer vision technologies such as vision significance detection, vision servo technology and the like are integrated in a service robot controlled by a brain-computer interface system, so that the system is optimized to reduce the operation burden of disabled people, and the method has exploration potential.

In order to achieve the purpose, the invention adopts the following technical scheme:

a vision-fused service robot control method comprises the following steps:

step S1, acquiring scene information of the service robot for analysis, and acquiring potential target objects and pixel positions thereof;

step S2, sequencing the target objects according to the significance, sequentially displaying the target object pictures according to the sequencing by the human-computer interaction interface, and sequentially inquiring whether the user grabs or not through an inquiry window;

step S3, recognizing the electroencephalogram of the user by adopting a brain-computer interface, recognizing the intention of the user, judging whether to grab an object of a human-computer interaction interface, if so, turning to step S4, and if not, continuing to display the next target object picture;

and step S4, moving the service robot to the vicinity of the target object according to the target object selected by the user, and controlling the mechanical arm of the service robot by using the visual servo to complete grabbing.

In a further optimization of the present technical solution, the step S1 specifically includes the following steps,

collecting image and depth information by using camera, performing target detection on the collected image by using neural network model, and detecting n objects O₁、O₂…O_nAnd its corresponding quantity ratio omega₁、ω₂…ω_nGo through m scenes E₁、E₂、…、E_mThe scene E where the service robot is located is identified, and naive Bayes is used for training a scene identification classifier; when the l-th feature is a continuous value, making it subject to a Gaussian distribution, thenCorresponding scene E_i，P(a_l|E_i) Is characterized by a_lThe probability of the occurrence of the event is,

the expression class is E_iThe mean of the features of dimension i,

the expression class is E_iIn the sample of (1), the variance of the l-dimensional feature, C_NBCIn order to be a naive bayes probability model,

in a scene E, firstly, screening for the first time to remove background objects and objects which cannot be captured by the service robot; performing secondary screening on the remaining objects, and selecting c target objects k1, k2, … and kc which are most likely to be selected under the scene E where the user is located; res₁And res₂Results of the first and second screens, res, respectively₁Based on the screening conditions S to O_nScreening individual objects res₂Result res of the first screening based on the screening condition E₁The screening is carried out, and the screening is carried out,

res₁＝classifier₁(O_n,S)，

res₂＝classifier₂(res₁,E)，

the neural network model can obtain a rectangular identification frame (x) of the target object k_k,y_k,w_k,h_k) And x and y are respectively the abscissa and the abscissa of the pixel at the upper left corner of the rectangular recognition boxThe ordinate, w and h are the length and width of the rectangular identification box, respectively.

In a further optimization of the present technical solution, the step of sorting the target objects according to the significance in step S2 includes the following steps:

sequencing the target articles by utilizing two-dimensional Gaussian distribution in combination with a significance detection result, sequentially presenting the target articles on a human-computer interaction interface from high to low according to the significance of an identification frame, and popping up an inquiry window for judging whether to grab or not; the significance of the recognition box is specifically formulated as follows:

Result＝rank[Obj(k)]，

wherein i and j are the abscissa and ordinate of a pixel point on the image; x is the number of_center＝x+w/2，y_centerY + h/2 are horizontal and vertical coordinate values of the center of the detection rectangular frame respectively; h (i, j) is the output value of the significance detection of the pixel point; g (i, j) is the Gaussian distribution probability value of the pixel point in the corresponding rectangular identification frame; and multiplying H (i, j) and G (i, j) of each point of the rectangular identification frame, and then accumulating to obtain the significance of the frame.

In a further optimization of the present technical solution, in step S4, the robot arm is controlled by the visual servo module to grasp, and the rectangular identification frame (x) of the target item k is identified_k,y_k,w_k,h_k) Calculating the mean value of all point cloud data after denoising to obtain the position of a target object relative to a depth camera, converting the position into the position (p.x, p.y and p.z) relative to a mechanical arm coordinate system through coordinate conversion, wherein the mechanical arm consists of 5 rotating joints and connecting rods, the first 4 rotating joints control the posture of the mechanical arm, and the last 1 rotating joint controls the grabbing of the mechanical arm, so that the mechanical arm l can stably grab the target object₄Held horizontal while grasping, i.e.:

θ₂+θ₃+θ₄＝0，

from the geometric analysis it is possible to obtain:

l₁+l₂cosθ₂+l₃cos(θ₂+θ₃)＝p.z，

for convenience of presentation, the side length l and the angle are introduced

The angle of each joint can be obtained by the formula, namely the angle theta₁、θ₂、θ₃、θ₄，θ₅The range of the grabbing angle is 0.2-0.3rad,

θ₁＝arctan(p.x,p.y)，

θ₄＝-(θ₂+θ₃)。

different from the prior art, the technical scheme has the following beneficial effects:

based on the service robot control method, the service robot is controlled through the electroencephalogram signal under the assistance of computer vision, the visual burden and fatigue of a user are reduced, the user is helped to easily obtain required articles, and the service robot control method can be used for improving the self-care level of patients with serious dyskinesia.

Drawings

FIG. 1 is a flow chart of a vision-converged service robot control method;

FIG. 2 is a schematic diagram of scene-based target determination and saliency ranking;

FIG. 3 is a diagram of a kinematic analysis of a robotic arm;

FIG. 4 is a schematic view of a vision-converged service robot control system;

FIG. 5 is a schematic diagram of a human-machine interface.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Fig. 1 shows a flowchart of a service robot control method for visual fusion. The invention provides a control method of a vision-integrated service robot, which comprises the following specific steps:

and step S1, acquiring scene information of the service robot, and analyzing to obtain potential target objects and pixel positions thereof.

The service robot carries out target inference based on scenes and obtains potential target objects and pixel positions thereof.

And acquiring the current image and the depth information by using a component depth camera of the service robot. The acquired image is subjected to target detection by using a neural network model, and the target detection model used in the embodiment is YOLOv 3. The backbone network for the YOLOv3 feature extraction is Darknet53, and the full connection layer is replaced by 52 convolutional layer structures and 1x1 convolution. YOLOv3 used a multi-scale feature fusion prediction approach to detect objects of different sizes. The input image size is 416x3, and feature maps at the scales of 13 x 13, 26 x 26 and 52 x 52 can be obtained for detecting large, medium and small targets. 9 anchor frames with different sizes, namely prior frames in the target detection process, are set under the original image pixels. The 9 anchor boxes make the output of each scale evenly distributed to the 3 anchor boxes. For each anchor box, 5 parameters of target coordinates x and y, target sizes w and h, confidence C, and category number C can be output, 5+ C channels in total, and 3 anchor boxes per scale output (5+ C) × 3 channels.

Using n detected objects O₁、O₂…O_nAnd its corresponding quantity ratio omega₁、ω₂…ω_nGo through m scenes E₁、E₂、…、E_mThe scene E of the service robot can be identified. And (3) carrying out training of the scene recognition classifier by using a naive Bayes algorithm. Considering the non-linearity of the change of the characteristic, the characteristic is continuous in the first characteristic and is made to be in Gaussian distribution, and the formula is as follows, P (a)_l|E_i) Is characterized by a_lThe probability of the occurrence of the event is,

the expression class is E_iThe mean of the features of dimension i in the sample of (1).

The expression class is E_iIn the sample of (2), the variance of the l-th dimension feature. C_NBCIs a naive bayes probability model.

In the scene E, first screening is performed, and the tag S is used to remove background objects and objects that cannot be captured by the service robot. The labels of the background articles and the articles which cannot be grabbed by the service robot in the screener are 0, and the labels of the articles which can be grabbed by the service robot are 1; and screening the remaining objects for the second time to screen out the objects which are unlikely to appear or have extremely low probability under the scene E of the user, and selecting the c target objects k which are most likely to be selected₁、k₂、…、k_c. Res in the formula₁And res₂The results of the first and second screening are obtained. res₁Based on the screening conditions S to O_nScreening the objects; res₂Result res of the first screening based on the screening condition E₁And (5) screening.

res₁＝classifier₁(O_n,S)，

res₂＝classifier₂(res₁,E)，

The neural network model can obtain a rectangular identification frame (x) of the target object k_k,y_k,w_k,h_k) X and y are respectively the abscissa and the ordinate of the pixel at the upper left corner of the rectangular recognition box, and w and h are respectively the length and the width of the rectangular recognition box.

And step S2, sequencing the target objects according to the significance, sequentially displaying the target object pictures according to the sequencing by the human-computer interaction interface, and sequentially inquiring whether the user grabs or not through the inquiry window.

Fig. 2 is a schematic diagram of scene-based target determination and saliency ranking. The significance detection model used in this example was EML-Net. And the EML-Net trains two pre-trained DenseNet and NasNet deep networks independently at the coding stage, the output is the output of the last layer of the network, the last full-connection layer is replaced by 1x1 convolution, and the network is obtained after 1x1 convolution. In the decoding stage, the two networks trained before are subjected to multilevel feature, four layers are extracted by DenseNet, three layers are extracted by NasNet, and the total number of the layers is seven, then seven feature maps are obtained by respectively using 1x1 convolution, the seven feature maps are sampled to have the same size with the maximum feature map, finally, 1x1 convolution is used to obtain a final result, and the output result is subjected to normalization processing to be used as a significant value of each pixel.

And (4) sequencing the target articles by utilizing two-dimensional Gaussian distribution in combination with the significance detection result, and calculating the significance in the rectangular frame obtained by the Yolov 3. And arranging the target items according to the sequence of the significance from high to low, namely identifying the intention of the user. And displaying the frames on the human-computer interaction interface from high to low in sequence according to the significance of the identification frames, and popping up an inquiry window for judging whether to grab. The specific formula is as follows:

Result＝rank[Obj(k)]，

wherein i and j are the abscissa and ordinate of a pixel point on the image; x is the number of_center＝x+w/2，y_centerY + h/2 are horizontal and vertical coordinate values of the center of the detection rectangular frame respectively; h (i, j) is the output value of the significance detection of the pixel point; g (i, j) is the Gaussian distribution probability value of the pixel point in the corresponding rectangular identification frame; multiplying H (i, j) and G (i, j) of each point of the rectangular identification frame and then accumulating to obtain the significance obj (k) of the frame; the significance is calculated in such a way that the significant region is concentrated on the center of the article as much as possible, and the influence of the size of the rectangular recognition frame on the calculation result is eliminated.

And step S3, recognizing the electroencephalogram of the user by adopting a brain-computer interface, recognizing the intention of the user, judging whether to grab an object of the human-computer interaction interface, if so, turning to step S4, and if not, continuing to display the next target object picture. And acquiring the electroencephalogram signals generated by the user by using a brain-computer interface, processing and identifying the selection corresponding to 'yes' and 'no'.

Electroencephalogram (EEG) often has certain rhythmicity and spatial distribution characteristics, and the consciousness state causing the electroencephalogram can be distinguished by utilizing a specific calculation method to extract the characteristics and identifying the characteristics, so that the electroencephalogram is converted into the thinking intention of the human brain, and further used for controlling external equipment. Because the electroencephalogram signal is a very weak nonlinear electrophysiological signal, the amplitude of the electroencephalogram signal is in the millivolt level, the frequency domain distribution range is 0.5 to 50Hz, and signal amplification and denoising processing are required before feature extraction.

The brain-machine interface used in this embodiment is an exogenous brain-machine interface based on Steady State Visual Evoked Potential (SSVEP). The SSVEP signal is an evoked potential generated by the optical signal stimulating the visual system. It is evoked by a visual stimulus of fixed frequency, the visual system producing an evoked response with a steady frequency signature when the visual stimulus flashes periodically at a specific frequency. The brain-computer interface based on the SSVEP flickers the 'yes' and 'no' options at different fixed frequencies, stimulates the occipital region of the brain of the user to generate electroencephalogram signals with corresponding frequencies, acquires the electroencephalogram signals, amplifies the electroencephalogram signals by an amplifier, removes power frequency interference and eliminates artifact signals by signal preprocessing, identifies the signals by a typical correlation analysis method, and corresponds the identification results to the 'yes' and 'no' judgment of the user. The 'yes' output signal is used for controlling the service robot to take the target object through visual servo; and the output signal of 'no' is used for controlling the human-computer interaction interface to be switched to the next target object picture.

And step S4, moving the service robot to the vicinity of the target object according to the target object selected by the user, and controlling the mechanical arm of the service robot by using the visual servo to complete grabbing. If the user selects 'yes', the service robot moves to the position near the target object, and the mechanical arm is controlled to complete grabbing by using the visual servo; if the user selects "no", the man-machine interface displays the next target item picture, and step S3 is repeated.

The specific method for controlling the mechanical arm to grab the object by the visual servo comprises the following steps: acquiring a rectangular recognition frame (x) of a target item k by using depth information acquired by a depth camera_k,y_k,w_k,h_k) And (3) carrying out denoising processing on the point cloud data of all the points, calculating a mean value to obtain the position of the target object relative to the depth camera, and converting the position into the position relative to a mechanical arm coordinate system through coordinate conversion (p.x, p.y and p.z). The arm comprises 5 revolute joints and connecting rod, and the gesture of preceding 4 revolute joints control arm, and snatching of last 1 revolute joint control arm. First of all, calculateWhether the target object is within the working space of the robot arm. If the robot arm is in the working space of the robot arm, the robot arm is controlled to grab by the servo grabbing module. If the robot arm exceeds the working space of the mechanical arm, the robot arm feeds back to the movable platform to enable the service robot to move into the working space of the mechanical arm, and then the follow-up grabbing action is completed. In order to ensure that the mechanical arm can stably grasp, the grasping posture of the mechanical arm is set, and reference is made to fig. 3, which is a kinematic analysis diagram of the mechanical arm. Make the mechanical arm l₄Held horizontal while grasping, i.e.:

θ₂+θ₃+θ₄＝0，

from the geometric analysis it is possible to obtain:

l₁+l₂ cosθ₂+l₃ cos(θ₂+θ₃)＝p.z，

for convenience of presentation, the side length l and the angle are introduced

The angle of each joint can be obtained by the formula, namely the angle theta₁、θ₂、θ₃、θ₄。θ₅For grabbing angle, 0.2-0.3rad is taken according to actual needs.

θ₁＝arctan(p.x,p.y)，

θ₄＝-(θ₂+θ₃)。

And controlling each rotary joint to rotate the angle in sequence according to the angle obtained by the kinematic analysis to complete the grabbing action.

Fig. 4 is a schematic diagram of a vision-converged service robot control system. The system mainly comprises two parts, namely a brain-computer interface part based on SSVEP and a service robot part. The brain-computer interface part based on the SSVEP comprises brain electrical signal acquisition equipment, a human-computer interaction interface and a first processing unit; the service robot part comprises a movable platform, an identification and positioning module, a servo grabbing module and a second processing unit.

The electroencephalogram signal acquisition equipment comprises a biological signal amplifier with a plurality of sampling channels, a set of high-performance active electrode system (gamma box for short) capable of recording non-invasive electrophysiological signals, an electroencephalogram cap and a plurality of electroencephalogram electrodes. The electroencephalogram signal acquisition equipment used in the embodiment is g.tec, and comprises a g.gamma cap, a g.gamma ys gamma box and a g.usdamp amplifier.

Referring to fig. 5, the schematic diagram of the human-computer interaction interface includes two layers of interfaces, where the first page is a mobile navigation interface; the second page is an identification capture interface. The display screen of the man-machine interface used in the embodiment is a 23.8-inch LCD display screen. The mobile navigation interface controls the robot to perform mobile navigation, and the interface is provided with 5 frequency flicker blocks, wherein the frequencies are respectively 6Hz, 7Hz, 8Hz, 9Hz and 10Hz and respectively correspond to commands of 'forward', 'backward', 'left turn', 'right turn' and 'next page'; the first 4 can realize the movement of any position in navigation; the next page can switch the flashing interface to the recognition grabbing interface. The recognition grabbing interface can be used for the user to select grabbing from recognized target objects in sequence. The interface has 3 scintillation blocks, the frequencies are respectively 9Hz, 11Hz and 13Hz, and the scintillation blocks respectively correspond to the commands of 'previous page', 'yes' and 'no'; the 'previous page' can switch the interface to the mobile navigation interface; and after the scene-based target judgment is finished, the 'yes' and 'no' commands appear in a popup mode along with the picture of the target object, so that the user can select whether to grab the target object.

The first processing unit and the second processing unit realize data transmission between the brain-computer interface based on the SSVEP and the service robot by utilizing a communication protocol of TCP/IP.

The service robot, which is a turkebot used in this embodiment, includes a movable platform, an identification and capture module, and a servo capture module. The movable platform is a Kobuki chassis and is a two-wheel differential base, and the robot can stably move forwards and backwards and rotate left and right. And a visual sensor of the identification and grabbing module is a PrimeSense depth camera, when the PrimeSense acquires an image, the image is identified by using a scene-based target judgment method, and position information in an identification result is transmitted to the servo grabbing module. The main component of the servo grabbing module is a Turtlebot _ Arm robot, and the module is responsible for automatically grabbing a target object.

The working environment of the service robot of the present embodiment is set to a home. The user sits at the position about 70cm in front of the display screen, wears the electroencephalogram cap on the head, and selects 9 electrode channels of the occipital lobe area of the brain as signal sources for analyzing the response of the steady-state visual evoked potential. The gamma box and the amplifier are connected through the electrode wires, and the acquisition and identification processing of the electroencephalogram signals are completed after the gamma box and the amplifier are connected with the first processing unit. The user controls the robot to move and navigate by watching the 'forward', 'backward', 'left turn', 'right turn' flashing blocks of the mobile navigation interface. When a user navigates the service robot to a position near an expected acquired article in a living room, the man-machine interaction interface can be switched to the recognition grabbing interface by watching the 'next page' of the flashing block, and the service robot recognizes a plurality of articles by using the object recognition and positioning module and performs scene analysis. And after the environment is analyzed to be a living room scene, objects which cannot be grabbed by sofas, tea tables and the like are removed, and the television remote controller, the tea cup, the potato chips, the vase and the like are sequentially used as target objects according to the significance sequence by combining the scene. The television remote controller is used as a target to be presented on the recognition grabbing interface, if the user watches the 'yes' flicker block, the Turtlebot automatically adjusts to the grabbing posture of the mechanical arm and then the grabbing is completed; if the user watches the 'no' flicker block, identifying that the grabbing interface presents a target object teacup ranked in the second order; by analogy, the Turtlebot automatically adjusts to the grabbing posture of the mechanical arm until the user watches the 'yes' flash block, and then grabbing is completed; after the grabbing is successful, the user returns to the mobile navigation interface by watching the previous flashing block to navigate the robot to the delivery destination. The above is that the service robot successfully completes one item delivery task.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. A vision-fused service robot control method is characterized by comprising the following steps:

the step S2 of sorting the target objects according to their saliency includes the following steps:

Result＝rank[Obj(k)]，

wherein i and j are the abscissa and ordinate of a pixel point on the image; the rectangular recognition box (x) of the potential target object k obtained in step S1_k,y_k,w_k,h_k) X and y are respectively the abscissa and the ordinate of the pixel at the upper left corner of the rectangular identification frame, and w and h are respectively the length and the width of the rectangular identification frame; x is the number of_center＝x+w/2，y_centerY + h/2 are horizontal and vertical coordinate values of the center of the detection rectangular frame respectively; h (i, j) is the output value of the significance detection of the pixel point; g (i, j) is the Gaussian distribution probability value of the pixel point in the corresponding rectangular identification frame; multiplying H (i, j) and G (i, j) of each point of the rectangular identification frame and then accumulating to obtain the significance obj (k) of the frame; the rank function isA ranking function;

2. The vision-converged service robot control method according to claim 1, wherein the step S1 specifically includes the steps of,

collecting image and depth information by using camera, performing target detection on the collected image by using neural network model, and detecting n objects O₁、O₂…O_nAnd its corresponding quantity ratio omega₁、ω₂…ω_nGo through m scenes E₁、E₂、…、E_mThe scene E where the service robot is located is identified, and naive Bayes is used for training a scene identification classifier; if the ith characteristic is a continuous value and is made to obey Gaussian distribution, the corresponding scene E_i，P(a_l|E_i) Is characterized by a_lThe probability of the occurrence of the event is,

the expression class is E_iThe mean of the features of dimension i,

res₁＝classifier₁(O_n,S)，

res₂＝classifier₂(res₁,E)，

3. The vision-fusion service robot control method according to claim 1, wherein the step S4 is performed by using the vision servo module to control the robot arm to grasp the rectangular recognition frame (x) of the target item k_k,y_k,w_k,h_k) Calculating the mean value of all point cloud data after denoising to obtain the position of a target object relative to a depth camera, converting the position into the position (p.x, p.y and p.z) relative to a mechanical arm coordinate system through coordinate conversion, wherein the mechanical arm consists of 5 rotating joints and connecting rods, the first 4 rotating joints control the posture of the mechanical arm, and the last 1 rotating joint controls the grabbing of the mechanical arm, so that the mechanical arm l can stably grab the target object₄Held horizontal while grasping, i.e.:

θ₂+θ₃+θ₄＝0，

from the geometric analysis it is possible to obtain:

l₁+l₂cosθ₂+l₃cos(θ₂+θ₃)＝p.z，

for convenience of presentation, the side length l and the angle are introduced

θ₁＝arctan(p.x,p.y)，

θ₄＝-(θ₂+θ₃)。