Disclosure of Invention
The invention aims to provide a method and a system for taking an elevator by a robot, which can avoid wasting the time for taking the elevator and improve the friendly interaction capability of the robot.
The technical scheme provided by the invention is as follows:
in one aspect, a method of using a robot to board an elevator is provided, comprising:
when the elevator door is opened, acquiring an image in the elevator;
inputting the image into a pre-trained neural network model to obtain a target detection result and a preset area detection result, wherein the preset area is an area which needs to be occupied when the robot enters the elevator;
and judging whether the elevator can be entered or not according to the target detection result and a preset area detection result, if so, entering the elevator, otherwise, releasing the elevator, and calling the elevator again.
Further preferably, the determining whether the elevator can be entered according to the target detection result and a preset area detection result specifically includes:
judging whether the preset area is occupied or not according to the detection result of the preset area, and if not, judging that the elevator can be accessed;
if the preset area is occupied according to the detection result of the preset area, judging whether the occupied object is an object or not according to the target detection result;
if the occupied object is an object, judging that the elevator cannot be accessed;
if the occupied object is not an object, outputting avoidance information, waiting for a preset time, acquiring a new image in the elevator again, inputting the new image into the neural network model, acquiring a target detection result and a preset area detection result again, judging whether the elevator can be entered according to the acquired target detection result and the preset area detection result again, and judging that the elevator cannot be entered when the number of times of repeatedly acquiring the image in the elevator is greater than a preset threshold value.
Further preferably, the inputting the image into a pre-trained neural network model, and the obtaining the target detection result and the preset region detection result specifically includes:
inputting the images into a pre-trained neural network model, and extracting feature maps of different layers;
carrying out feature fusion and resolution improvement on the feature maps of different layers;
and outputting a target detection result and a preset region detection result according to the feature map obtained after feature fusion and resolution improvement.
Further preferably, the inputting the image into a pre-trained neural network model, and the extracting feature maps of different layers specifically includes:
inputting the image into a pre-trained neural network model, extracting a first feature map of the last layer, and extracting a second feature map of the front layer, wherein the resolution of the second feature map is twice of that of the first feature map;
the performing feature fusion and resolution improvement on the feature maps of different layers specifically includes:
processing the first characteristic diagram according to a preset step, so that the resolution of the first characteristic diagram is doubled, and a third characteristic diagram is obtained;
performing 1 × 1 convolution processing on the second feature map, and then fusing the second feature map with the third feature map to obtain a fourth feature map;
processing the fourth feature map for multiple times according to the preset step, and increasing the resolution of the fourth feature map for multiple times to obtain a fifth feature map;
the outputting of the target detection result and the preset area detection result according to the feature map obtained after feature fusion and resolution improvement specifically comprises:
and outputting a target detection result and a preset area detection result according to the fifth characteristic diagram.
Further preferably, the processing the first feature map according to the preset step to increase the resolution of the first feature map by one time, and the obtaining a third feature map specifically includes:
deconvolving the first feature map;
performing channel separation on the deconvolved first feature map;
performing convolution processing on the first characteristic diagram after channel separation by adopting two different convolution kernels;
and performing channel fusion on the feature map obtained after the convolution processing to obtain a third feature map.
In another aspect, there is also provided a system for a robot to board an elevator, comprising:
the image acquisition module is used for acquiring an image in the elevator when the elevator door is opened;
the detection module is used for inputting the image into a pre-trained neural network model to obtain a target detection result and a preset area detection result, wherein the preset area is an area which needs to be occupied when the robot enters the elevator;
and the processing module is used for judging whether the elevator can be entered or not according to the target detection result and a preset area detection result, entering the elevator if the elevator can be entered, releasing the elevator if the elevator can not be entered, and calling the elevator again.
Further preferably, the processing module comprises:
the processing unit is used for judging whether the preset area is occupied or not according to the detection result of the preset area, and if not, judging that the elevator can be entered;
the processing unit is further configured to determine whether an occupied object is an object according to the target detection result if the preset area is determined to be occupied according to the preset area detection result;
the processing unit is further used for judging that the elevator cannot be accessed if the occupied object is an object;
the processing unit is further used for outputting avoidance information if the occupied object is not an object, obtaining a new image in the elevator again after waiting for a preset time, inputting the new image into the neural network model, obtaining a target detection result and a preset area detection result again, judging whether the elevator can be entered according to the obtained target detection result and the preset area detection result again, and judging that the elevator cannot be entered when the number of times of repeatedly obtaining the image in the elevator is larger than a preset threshold value.
Further preferably, the detection module comprises:
the characteristic extraction unit is used for inputting the images into a pre-trained neural network model and extracting characteristic graphs of different layers;
the resolution improving unit is used for carrying out feature fusion and resolution improvement on the feature maps of different layers;
and the task output unit is used for outputting a target detection result and a preset area detection result according to the feature map obtained after feature fusion and resolution improvement.
Further preferably, the feature extraction unit is further configured to input the image into a pre-trained neural network model, extract a first feature map of a last layer, and extract a second feature map of a previous layer, where a resolution of the second feature map is twice of a resolution of the first feature map;
the resolution increasing unit includes:
the resolution improving subunit is configured to, according to a preset step, process the first feature map to improve the resolution of the first feature map by one time, so as to obtain a third feature map;
the feature fusion subunit is configured to perform 1 × 1 convolution processing on the second feature map and then fuse the second feature map with the third feature map to obtain a fourth feature map;
the resolution enhancement subunit is configured to perform multiple processing on the fourth feature map according to the preset step, and enhance the resolution of the fourth feature map by multiple times to obtain a fifth feature map;
and the task output unit is further used for outputting a target detection result and a preset area detection result according to the fifth feature map.
Further preferably, the resolution enhancement subunit is further configured to deconvolve the first feature map; performing channel separation on the deconvolved first feature map; performing convolution processing on the first characteristic diagram after channel separation by adopting two different convolution kernels; and performing channel fusion on the feature map obtained after the convolution processing to obtain a third feature map.
Compared with the prior art, the method and the system for the robot to board the elevator have the following beneficial effects: according to the invention, whether the elevator can be accessed is judged by acquiring the image in the elevator and detecting whether the preset area in the elevator is occupied and the occupied object, and when the elevator cannot be accessed, the elevator is directly released, so that the time of occupying the elevator taking people is avoided, and the friendly interaction capacity of the robot is improved; in addition, the invention judges whether the elevator can be entered or not by the detection mode of obtaining the image and the neural network model, thereby avoiding the problem that the blocked part cannot be detected due to the unknown object type when the depth sensor is used, causing the wrong space judgment, and improving the judgment accuracy.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, without inventive effort, other drawings and embodiments can be derived from them.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
According to a first embodiment provided by the present invention, as shown in fig. 1, a method of riding an elevator by a robot includes:
s100, when the elevator door is opened, obtaining an image in the elevator;
s200, inputting the image into a pre-trained neural network model to obtain a target detection result and a preset area detection result, wherein the preset area is an area which needs to be occupied when the robot enters the elevator;
s300, judging whether the elevator can be entered or not according to the target detection result and the preset area detection result, if so, entering the elevator, otherwise, releasing the elevator, and calling the elevator again.
Specifically, when the robot needs to take the elevator, the robot firstly runs to the position of the elevator and then calls the elevator; when the lift-cabin door is opened, the robot is just to the elevator, acquires the inside image of elevator through the camera of installing on the robot. The camera can be installed on the robot and also can be installed in the elevator, after the robot calls the elevator, the server dispatches the target elevator to the target floor where the robot is located, and when the target elevator reaches the target floor where the robot is located, the server controls the camera in the elevator to acquire the image in the elevator.
After the images in the elevator are obtained, the obtained images are identified through a pre-trained neural network model, and a target detection result and a preset area detection result are obtained. The target refers to a relatively common object or person, for example, in a hospital setting, the target refers to a person or a relatively common location category object such as a hospital bed, cart, wheelchair, etc. in the hospital. The preset area refers to the position that the robot needs to be able to take into the elevator. Since the size of the robot is known, a preset area that can accommodate the robot can be extracted and set in the elevator, and the size of the preset area is larger than that of the robot. When entering the elevator, the robot can only move forwards or backwards and cannot move left and right, so that the preset area is a fixed area in the elevator.
The neural network model is improved by using an object as Points basic structure, features are extracted by adopting a mobilenetv3 structure in a basic network part, an intermediate module is added, the intermediate module is used for carrying out resolution improvement on a feature map extracted from the mobilenetv3, and then the feature map with the improved resolution is input into a multi-task output module to output a detection result, namely, a branch of region judgment is added on the basis of object asPoints.
After the neural network model is built, acquiring a large number of images in the elevator, and marking the states of the target and the occupied preset area in the images in a manual marking mode; respectively training and testing the marked images, training the input well-constructed neural network model through the training set, testing the well-trained neural network model through the testing set, performing data enhancement on specific training samples in the training set according to a testing result, subdividing the enhanced data set into the training set and the testing set, training and testing the well-trained neural network model again, repeating the steps of data set enhancement, training and testing until a limit detection effect is obtained, and obtaining the well-trained neural network model.
After the target detection result and the preset area detection result are obtained, whether the elevator can be accessed can be judged according to the obtained result. For example, if it is detected that the preset area is not occupied, it indicates that the robot can directly enter the elevator, and if it is detected that the preset area is occupied by the hospital bed, it indicates that the robot cannot enter the elevator, and the elevator needs to be released to call another elevator again.
According to the invention, whether the elevator can be accessed is judged by acquiring the image in the elevator and detecting whether the preset area in the elevator is occupied and the occupied object, and when the elevator cannot be accessed, the elevator is directly released, so that the time of occupying the elevator taking people is avoided, and the friendly interaction capacity of the robot is improved; in addition, the invention judges whether the elevator can be accessed by the detection mode of obtaining the image and the neural network model, thereby avoiding the problem that the blocked part cannot be detected because the object type is unknown when the depth sensor is used, causing the wrong judgment of the space and improving the judgment accuracy.
According to a second embodiment provided by the present invention, as shown in fig. 2, a method for a robot to board an elevator, based on the first embodiment, the determining whether the elevator can be entered according to the target detection result and the preset area detection result specifically comprises:
judging whether the preset area is occupied or not according to the detection result of the preset area, and if not, judging that the elevator can be accessed;
if the preset area is occupied according to the detection result of the preset area, judging whether the occupied object is an object or not according to the target detection result;
if the occupied object is an object, judging that the elevator cannot be accessed;
if the occupied object is not an object, outputting avoidance information, waiting for a preset time, acquiring a new image in the elevator again, inputting the new image into the neural network model, acquiring a target detection result and a preset area detection result again, judging whether the elevator can be entered according to the target detection result and the preset area detection result which are acquired again, and judging that the elevator cannot be entered when the number of times of repeatedly acquiring the image in the elevator is larger than a preset threshold value.
Specifically, after a target detection result and a preset area detection result are obtained, the state of the preset area is judged according to the preset area detection result, and if the preset area is not occupied, the robot can be directly judged to enter the elevator.
If at least 10% (which can be set according to actual conditions) of the preset area is found to be classified as the occupied state according to the detection result of the preset area, analyzing the object occupying the preset area according to the target detection result. If the non-pedestrian occupying the preset area is a sickbed and the like, the non-pedestrian can not enter the elevator, the elevator is directly released, other elevators are called again, and a new detection and judgment process is entered.
If the robot occupies the preset area, the robot carries out voice broadcast to prompt pedestrians to avoid; and then, after waiting for a preset time (such as waiting for 5 seconds), shooting again to obtain a new image in the elevator, detecting the new image through a neural network model, and obtaining a new target detection result and a preset area detection result to judge whether the pedestrian avoids. If the pedestrian has avoided, the robot can enter the elevator by judging, and the robot can directly drive into the elevator.
If the pedestrian still carries out avoidance, voice broadcasting is carried out again to prompt the pedestrian to avoid, after the preset time is waited, the whole judging process is carried out again, namely the steps S100, S200 and S300 are carried out again; when the times of repeatedly executing the whole judging process exceed a preset threshold (such as 2 times, 3 times and the like), it indicates that the pedestrians in the elevator cannot avoid the elevator, or the pedestrians in the elevator do not want to avoid the elevator, at the moment, the elevator is directly released, the elevator is abandoned, and the elevator is called again.
In this scheme, when predetermineeing the region and being occupied by the pedestrian, through many times detection and judgement, can distinguish out whether friendly the interaction of people, when the friendly interaction of pedestrian, the elevator is directly released, avoids extravagant pedestrian and robot self time.
According to a third embodiment provided by the present invention, as shown in fig. 3, a method for a robot to board an elevator, based on the first embodiment or the second embodiment, the step S200 of inputting the image into a pre-trained neural network model, and the obtaining of the target detection result and the preset area detection result specifically includes:
s210, inputting the image into a pre-trained neural network model, and extracting feature maps of different layers;
s220, performing feature fusion and resolution improvement on the feature maps of different layers;
and S230, outputting a target detection result and a preset region detection result according to the feature map obtained after feature fusion and resolution improvement.
Specifically, after the obtained images inside the elevator are input into a trained neural network model, feature maps of different layers are extracted through mobilenetv 3. In the convolutional neural network, a high-level feature map has stronger semantic property and lower resolution, and global and contour features are mapped. The low-level feature map has weaker semanteme and higher resolution, and local and detail features are mapped.
And performing feature fusion on feature maps of different layers to combine features in the low-level feature map and features in the high-level feature map, further performing complementary fusion on the features, then performing resolution improvement on the feature map obtained after feature fusion, and then obtaining a target detection result and a preset area detection result according to the feature map obtained after resolution improvement. By fusing the features and improving the resolution, the detection of the target in the image and the detection of the area in the image can be well considered.
Illustratively, the process of acquiring the detection result of the preset area includes: assuming that the resolution of the feature map after resolution enhancement is 1/n of the obtained image inside the elevator (original image), that is, each point in the finally obtained feature map corresponds to an n × n area in the original image, performing classification detection on each point in the finally obtained feature map, that is, performing classification detection on each n × n area in the original image, judging whether each area in the original image is occupied or not, and then extracting whether each n × n area in the preset area is occupied or not, thereby obtaining a preset area detection result.
According to the scheme, the characteristics in the low-level characteristic diagram and the characteristics in the high-level characteristic diagram are subjected to complementary fusion, so that the high efficiency and the accuracy of target detection can be improved.
According to a fourth embodiment of the present invention, as shown in fig. 4, a method for a robot to board an elevator, comprises:
s100, when the elevator door is opened, acquiring an image in the elevator;
s211, inputting the image into a pre-trained neural network model, extracting a first feature map of a last layer, and extracting a second feature map of a front layer, wherein the resolution of the second feature map is twice of that of the first feature map;
s221, processing the first feature map according to a preset step, so that the resolution of the first feature map is doubled, and a third feature map is obtained;
s222, fusing the second feature map and the third feature map after performing 1 × 1 convolution processing on the second feature map to obtain a fourth feature map;
s223, processing the fourth feature map for multiple times according to the preset step, and increasing the resolution of the fourth feature map for multiple times to obtain a fifth feature map;
s231, outputting a target detection result and a preset area detection result according to the fifth characteristic diagram, wherein the preset area is an area which needs to be occupied when the robot enters the elevator;
s300, judging whether the elevator can be entered or not according to the target detection result and the preset area detection result, if so, entering the elevator, otherwise, releasing the elevator, and calling the elevator again.
In particular, in network architectures, resolution is generally reduced and the number of channels is increased as the depth increases. As shown in fig. 5, assuming that the resolution of each feature map of the last layer extracted from the mobilenetv3 is 1/32 of the original image, each feature map of the last layer is defined as a first feature map; then, each feature map of the layer with the resolution of 1/16 of the original image is selected, and each feature map of the layer is defined as a second feature map.
And then, processing the first feature map according to a preset step, so that the resolution of the first feature map is doubled, and a third feature map is obtained, wherein the resolution of the third feature map is 1/16 of that of the original map.
And performing 1 × 1 convolution transformation on the second feature map to enable the number of channels of the second feature map to be the same as that of channels of the third feature map, and then fusing the second feature map and the third feature map after the convolution transformation to obtain a fourth feature map.
And performing resolution enhancement processing on the fourth feature map for 2 times according to the preset steps to obtain a fifth feature map, wherein the resolution of the obtained fifth feature map is 1/4 of that of the original map. After the fifth feature map is obtained, the Objects in the image are detected through the Objects as Points infrastructure, and then the region classification detection is carried out through the added region judgment branch.
When the area classification detection is carried out through the fifth feature map, each point in the fifth feature map is classified, whether each point is occupied or not is judged, one point in the fifth feature map corresponds to one 4 x 4 area in the original map, the original map is divided into 4 x 4 areas, each point in the fifth feature map is classified, each 4 x 4 area in the original map is classified and detected, the occupied condition of the area in the original map is obtained, the occupied condition of the preset area is obtained from the detection result of the original map, and the detection result of the target area is obtained.
Preferably, in step S221, according to a preset step, processing the first feature map to increase the resolution of the first feature map by one time, and obtaining a third feature map specifically includes:
deconvolving the first feature map;
performing channel separation on the deconvolved first feature map;
performing convolution processing on the first characteristic diagram after channel separation by adopting two different convolution kernels;
and performing channel fusion on the feature map obtained after the convolution processing to obtain a third feature map.
Specifically, the presetting step includes deconvolution, channel separation, convolution processing, and channel fusion. The specific steps are shown in fig. 5. The method comprises the steps of firstly carrying out deconvolution on a first feature map with the resolution being 1/32 of that of the original image, then carrying out channel separation, then respectively carrying out 3 x 3 convolution and 5 x 5 convolution, and finally carrying out channel fusion to obtain a third feature map, wherein the resolution of the obtained third feature map is doubled, and the resolution of the third feature map is 1/16 of that of the original image.
The resolution of the fourth feature map obtained by performing 1 × 1 convolution processing on the second feature map and then fusing the second feature map with the third feature map is also 1/16 of that of the original image. And then, performing resolution enhancement on the fourth feature map twice according to the preset steps, wherein the resolution of the obtained fifth feature map is 1/4 of that of the original map. By improving the resolution of the fifth feature map to 1/4 of the original image, the region segmentation of the original image is reasonable, the influence on the accuracy of region classification detection due to too large or too small region segmentation is avoided, and the detection of the object as Points by the algorithm can be considered.
It should be understood that, in the foregoing embodiments, the sequence numbers of the steps do not mean the execution sequence, and the execution sequence of the steps should be determined by functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
According to a fifth embodiment provided by the present invention, as shown in fig. 6, a system for a robot to board an elevator, includes:
the image acquisition module 10 is used for acquiring an image in the elevator when the elevator door is opened;
the detection module 20 is configured to input the image into a pre-trained neural network model, and obtain a target detection result and a preset area detection result, where the preset area is an area that the robot needs to occupy when entering the elevator;
and the processing module 30 is used for judging whether the elevator can be entered according to the target detection result and a preset area detection result, if so, entering the elevator, otherwise, releasing the elevator, and calling the elevator again.
Specifically, this embodiment is a device embodiment corresponding to the method embodiment, and specific effects are referred to the above embodiments, which are not described in detail herein.
According to a sixth embodiment of the present invention, in the system for boarding an elevator by a robot, on the basis of the fifth embodiment, the processing module 30 includes:
the processing unit is used for judging whether the preset area is occupied or not according to the detection result of the preset area, and if not, judging that the elevator can be entered;
the processing unit is also used for judging whether the preset area is occupied or not according to the target detection result if the preset area is occupied according to the preset area detection result;
the processing unit is also used for judging that the elevator cannot be accessed if the occupied object is an object;
and the processing unit is further used for outputting avoidance information if the occupied object is not an object, acquiring a new image in the elevator again after waiting for a preset time, inputting the new image into the neural network model, acquiring a target detection result and a preset area detection result again, judging whether the elevator can be entered according to the acquired target detection result and preset area detection result again, and judging that the elevator cannot be entered when the number of times of repeatedly acquiring the image in the elevator is greater than a preset threshold value.
Specifically, this embodiment is a device embodiment corresponding to the method embodiment, and specific effects are referred to the above embodiments, which are not described in detail herein.
According to a seventh embodiment of the present invention, in the system for boarding an elevator by a robot, based on the fifth or sixth embodiment, the detection module 20 includes:
the characteristic extraction unit is used for inputting the images into a pre-trained neural network model and extracting characteristic graphs of different layers;
the resolution improving unit is used for carrying out feature fusion and resolution improvement on the feature maps of different layers;
and the task output unit is used for outputting a target detection result and a preset area detection result according to the feature map obtained after feature fusion and resolution improvement.
Specifically, this embodiment is a device embodiment corresponding to the method embodiment, and specific effects are referred to the above embodiments, which are not described in detail herein.
According to an eighth embodiment of the present invention, there is provided a system for a robot to board an elevator, which comprises, in addition to the seventh embodiment,
the feature extraction unit is further configured to input the image into a pre-trained neural network model, extract a first feature map of a last layer, and extract a second feature map of a previous layer, where a resolution of the second feature map is twice of a resolution of the first feature map;
the resolution increasing unit includes:
the resolution improving subunit is configured to, according to a preset step, process the first feature map to improve the resolution of the first feature map by one time, so as to obtain a third feature map;
the feature fusion subunit is configured to perform 1 × 1 convolution processing on the second feature map and then fuse the second feature map with the third feature map to obtain a fourth feature map;
the resolution improvement subunit is configured to perform multiple processing on the fourth feature map according to the preset step, and improve the resolution of the fourth feature map by multiple times to obtain a fifth feature map;
and the task output unit is further used for outputting a target detection result and a preset area detection result according to the fifth feature map.
Preferably, the resolution enhancement subunit is further configured to deconvolve the first feature map; performing channel separation on the deconvolved first feature map; performing convolution processing on the first characteristic diagram after channel separation by adopting two different convolution kernels; and performing channel fusion on the feature map obtained after the convolution processing to obtain a third feature map.
Specifically, this embodiment is a device embodiment corresponding to the method embodiment, and specific effects are referred to the above embodiments, which are not described in detail herein.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.