WO2023202062A1

WO2023202062A1 - Target docking method based on image recognition and terminal device and medium thereof

Info

Publication number: WO2023202062A1
Application number: PCT/CN2022/132656
Authority: WO
Inventors: 王雷; 陈熙
Original assignee: 深圳市正浩创新科技股份有限公司
Priority date: 2022-04-22
Filing date: 2022-11-17
Publication date: 2023-10-26
Also published as: CN114789440B; CN114789440A

Abstract

A target docking method based on image recognition, comprising: acquiring an environment image during the movement of a mobile robot to a target object, extracting an initial feature map related to the target object from the environment image, performing cross-convolution fusion on the initial feature map for a preset number of times, so as to extract multiple candidate position parameters of the target object in the environment image and a confidence corresponding to each candidate position parameter, determining target position information of the target object in the environment image, and controlling a motion state of the mobile robot according to the target position information, such that the mobile robot docks with the target object.

Description

Target docking method, terminal equipment and its media based on image recognition

Cross-references to related applications

This application requests the priority of the Chinese patent application submitted to the China Patent Office on April 22, 2022, with the application number 202210437107.7 and the invention title "Target docking method, device, equipment and medium based on image recognition", and its entire content incorporated herein by reference.

Technical field

The present application belongs to the field of image recognition technology, and in particular relates to a target docking method based on image recognition, a terminal device and its medium.

Background technique

The statements herein merely provide background information relevant to the present application and do not necessarily constitute exemplary techniques.

With the development of modern science and technology, various types of small mobile robots have improved the production level of society. For example, various types of household equipment such as sweeping robots, mopping robots, and lawn mowing robots bring convenience to people's home lives, and various types of transportation robots bring higher efficiency to factory transportation.

When the mobile robot is working, it is usually driven to reach the designated destination before the mobile robot starts to perform a series of related operations. Generally, the mobile robot determines the destination by identifying the target location of the target object. Since targets are usually small, the detection and recognition of target objects often involves a large amount of calculations, which in turn results in slow, low efficiency, and low recognition accuracy for position recognition of target objects. Due to inaccurate positioning of the target, the mobile robot is prone to yaw, resulting in the inability to accurately reach the location of the target.

Contents of the invention

According to various embodiments of the present application, a target docking method based on image recognition, a terminal device and a medium thereof are provided.

The embodiment of this application provides a target docking method based on image recognition, including:

Obtain the environment image of the mobile robot as it moves toward the target;

Extract initial feature maps about target objects from environmental images;

The initial feature map is subjected to a preset number of cross-convolution fusions to extract multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter. The cross-convolution fusion includes the initial feature map Perform different convolution residual processing and fuse the results obtained from different convolution residual processing;

Determine the target position information of the target object in the environment image based on multiple candidate position parameters and the confidence corresponding to each candidate position parameter;

Control the motion state of the mobile robot according to the target position information so that the mobile robot can dock with the target object.

The embodiment of the present application provides a target docking device based on image recognition, including:

The acquisition module is used to acquire the environment image of the mobile robot while it is moving toward the target;

The processing module is used to extract the initial feature map about the target object from the environment image;

The processing module is also used to perform a preset number of cross-convolution fusions on the initial feature map to extract multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter, where cross-convolution Fusion includes performing different convolution residual processing on the initial feature map and fusing the results obtained by different convolution residual processing;

The processing module is also used to determine the target position information of the target object in the environment image based on the multiple candidate position parameters and the confidence corresponding to each candidate position parameter;

The processing module is also used to control the motion state of the mobile robot according to the target position information, so that the mobile robot can dock with the target object.

Embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the method in the first aspect is implemented.

Embodiments of the present application provide a computer-readable storage medium that stores a computer program. When the computer program is executed by a processor, the method of the first aspect is implemented.

Embodiments of the present application provide a computer program product, which when the computer program product is run on a terminal device, causes the terminal device to execute the method described in any one of the above first aspects.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features, objects and advantages of the application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments or exemplary technologies of the present application, the drawings required for description of the embodiments or exemplary technologies will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, drawings of other embodiments can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic diagram of a mobile robot in an embodiment of the present application.

Figure 2 is a flow chart of a target docking method based on image recognition provided by an embodiment of the present application.

Figure 3 is an example of extracting an initial feature map in the embodiment of the present application.

Figure 4 is a schematic diagram of performing cross-convolution fusion with a preset number of times in an embodiment of the present application.

Figure 5 is a detailed flow chart of step 203 when only one cross-convolution fusion is performed in the embodiment of the present application.

Figure 6 is an example diagram of the network structure of the first cross-convolution fusion in the embodiment of this application.

Figure 7 is a processing flow chart of the i-th cross-convolution fusion and candidate position information extraction steps in the embodiment of the present application.

Figure 8 is an example diagram of the network structure of the i-th cross-convolution fusion according to this embodiment of the present application.

Figure 9 is a structural example diagram of multiple convolution layers and related functions used to obtain candidate location information in the embodiment of the present application.

Figure 10 is a schematic diagram of the sigmoid function in the embodiment of the present application.

Figure 11 is an example diagram of feature diagram 1 in the embodiment of the present application.

Figure 12 is a schematic diagram of candidate location information in an embodiment of the present application.

FIG. 13 is a flow chart of an implementation method for determining target position information of a target object in an environmental image in an embodiment of the present application.

Figure 14 is a flow chart for controlling the motion state of the mobile robot in the embodiment of the present application.

Figure 15 is one of the schematic diagrams of a target object in an environmental image in an embodiment of the present application.

Figure 16 is the second schematic diagram of the target object in the environment image in the embodiment of the present application.

Figure 17 is the third schematic diagram of the target object in the environment image in the embodiment of the present application.

FIG. 18 is a schematic diagram of the internal modules of a control unit 102 provided by an embodiment of the present application.

Figure 19 is a schematic diagram of a terminal device provided by an embodiment of the present application.

The realization of the purpose, functional features and advantages of the present application will be further described with reference to the embodiments and the accompanying drawings.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

Figure 1 is a schematic diagram of a mobile robot in an embodiment of the present application. The mobile robot 100 can be various types of sweeping robots, mopping robots, food delivery robots, transport robots, lawn mowing robots, etc. The embodiment of the present application does not limit the specific type and function of the mobile robot 100. It can be understood that the mobile robot in this embodiment may also include other devices with self-moving functions.

The mobile robot 100 is provided with a camera 101 . The camera 101 is used to capture images of the environment around the mobile robot 100 . The camera 101 may be fixed, or may be non-fixed and rotatable, which is not limited in the embodiments of the present application. The environmental images captured by the camera 101 may be color images, black and white images, infrared images, etc., which are not limited in the embodiments of the present application.

The camera 101 is connected to the control unit 102 inside the mobile robot 100 . The control unit 102 is also connected to the driving components of the mobile robot 100, such as the steering shaft, steering wheel, motor, etc. of the mobile robot 100, and is used to control the movement, steering, etc. of the mobile robot 100.

In the embodiment of the present application, the control unit 102 can receive the environmental image captured by the camera 101, process the environmental image according to the target docking method based on image recognition provided in the embodiment of the present application, and adjust the forward direction of the mobile robot 100 so that the mobile robot 100 advances towards the target and docks. The target object in the embodiment of the present application may refer to a target shelf, a target charging stand, a target location, etc., which is not limited in the embodiment of the present application. For example, when the mobile robot 100 is in a return-to-home charging scenario, the target object may be a target charging stand. During the return-to-home charging process, the mobile robot 100 moves toward the target charging base and docks with it to achieve charging.

The target docking method based on image recognition provided by the embodiment of the present application will be described in detail below. The target docking method can be implemented by the control unit 102 inside the mobile robot 100 or a cloud platform used to control the mobile robot 100. The embodiment of the present application does not limit the implementation subject of the target docking method. A detailed description will be given below with the control unit 102 as the execution subject.

Figure 2 is a flow chart of a target docking method based on image recognition provided by an embodiment of the present application. The process includes steps:

S201. Obtain the environment image of the mobile robot while it is moving toward the target object.

In the embodiment of the present application, during the movement of the mobile robot 100 to the target object, the control unit 102 can obtain an environment image through the camera 101 on the mobile robot 100 . It can be understood that the environment image may be a color image, a black and white image, or an infrared image, which is not limited in the embodiments of the present application.

In the scenario where the mobile robot 100 returns to home for charging, the control unit 102 receives a return to home charging instruction. The return-to-home charging instruction may be return-to-home information automatically generated by the control unit 102 when the remaining power of the mobile robot 100 is lower than the preset minimum power threshold, or it may be the mobile phone terminal/cloud platform issuing information for the robot to return to its destination based on user operations. . The return-to-home charging command can control the mobile robot 100 to move to the target charging base and obtain an image of the environment through the camera 101 on the mobile robot 100.

S202. Extract the initial feature map of the target object from the environment image.

In this embodiment of the present application, the control unit 102 may extract an initial feature map about the target object from the environment image through at least one convolution layer. Figure 3 is an example of extracting an initial feature map in the embodiment of the present application. In the embodiment of this application, the first item in the convolution layer specification parameter W represents the number of convolution kernels, the second item represents the number of channels, the third and fourth items represent the size of the convolution kernel, and the bias parameter B represents the bias value. The bias value is randomly generated for the first time and is subsequently reversely corrected through gradient. As shown in Figure 3, the specification parameter W of the convolution layer 301 is <32×12×3×3>, which means that the convolution layer 301 uses 32 convolution kernels of 3×3 size for an image with 12 input channels. Find the convolution operation. The specification parameter W of the convolution layer 302 is <64×32×1×1>, which means that 64 convolution kernels of 1×1 size are used to perform a convolution operation on an image with 32 input channels. In addition, Relu (Linear rectification function, linear rectification function) is an activation function, which will not be described in detail in the embodiment of this application. In the example of FIG. 3 , the control unit 102 performs a convolution operation on the environment image through the convolution layer 301 and the convolution layer 302 to obtain an initial feature map of the target object.

In practical applications, the control unit 102 may perform a convolution operation on the environment image through one or more convolution layers to obtain an initial feature map of the target object. The embodiments of this application do not limit the number of convolutional layers.

S203. Perform a preset number of cross-convolution fusions on the initial feature map to extract multiple candidate position parameters of the target object in the environment image and the confidence level corresponding to each candidate position parameter.

Among them, cross-convolution fusion is used to perform different convolution residual processing on the initial feature map and fuse the results obtained from different convolution residual processing. The candidate position information includes multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter.

Figure 4 is a schematic diagram of performing cross-convolution fusion with a preset number of times in an embodiment of the present application. As shown in FIG. 4 , the control unit 102 can perform a preset number of cross-convolution fusions on the initial feature map to obtain candidate position information of the target object. In this embodiment, cross-convolution fusion specifically fuses the convolution identification result obtained by convolution processing and the residual identification result obtained by residual processing to obtain the fusion result. Because the more convolution times, that is, the deeper the structure where the convolution layer is located, the higher-level semantic information is extracted at this time. The fewer the convolution times, the shallower the structure where the convolution layer is located, and the low-level semantic information is extracted at this time. semantic information. At the same time, since the residual processing is implemented through multiple convolutional layers in the residual network, that is, the semantic information obtained by the residual processing is deeper, therefore, the convolution recognition results and residuals will contain the low-level semantic information obtained by the convolution processing. The residual recognition results of the processed high-level semantic information are fused to make the feature information richer without adding additional convolutional layers, achieving high performance and improving recognition accuracy through a lightweight network.

In some embodiments, the control unit 102 may perform only one cross-convolution fusion on the initial feature map. Figure 5 is a detailed flow chart of step S203 when only one cross-convolution fusion is performed in the embodiment of the present application. As shown in Figure 5, step S203 may include the following steps:

S2031. Use the initial feature map as the input feature of the first cross-convolution fusion, and perform convolution processing to obtain a convolution recognition result.

Figure 6 is an example diagram of the network structure of the first cross-convolution fusion in the embodiment of this application. In the embodiment of the present application, as shown in FIG. 6 , the control unit 102 can perform convolution processing on the initial feature map through the convolution layer 605 to obtain a convolution recognition result.

S2032. Perform residual processing on the initial feature map to obtain a residual identification result.

In the embodiment of the present application, as shown in Figure 6, the control unit 102 can perform residual processing on the initial feature map through a residual network composed of convolution layer 601, convolution layer 602, convolution layer 603, and convolution layer 604. , get the residual identification results. It can be understood that Add in Figure 6 represents the identity mapping, which is the same as the identity mapping in the traditional residual network, and will not be described in detail in the embodiment of this application.

In the embodiment of the present application, in the residual network composed of the convolution layer 601, the convolution layer 602, the convolution layer 603, the convolution layer 604 and the corresponding identity mapping, the convolution layer 602, the convolution layer 603 and the identity mapping are The mappings form a residual block. In the example of Figure 6, the residual network includes one residual block. In practical applications, the residual block in the residual network may include more than one according to actual needs. The embodiment of the present application does not limit the number of residual blocks.

S2033. Fusion of the convolution recognition result and the residual recognition result to obtain the fusion result.

In the embodiment of the present application, as shown in Figure 6, the control unit 102 can fuse the convolution recognition result output by the convolution layer 605 and the residual recognition result output by the convolution layer 604 through a merged array operation (ie, concat operation). , get the fusion result.

S2034. Obtain multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter from the fusion result.

In some embodiments, the control unit 102 only performs one cross-convolution fusion on the initial feature map. After performing one cross-convolution fusion to obtain the fusion result, the candidate position information of the target object can be extracted from the fusion result, where, The candidate location information includes multiple candidate location parameters in the environment image and the confidence corresponding to each candidate location parameter.

In some embodiments, the control unit 102 may perform multiple cross-convolution fusions on the initial feature maps. In an embodiment in which multiple cross-convolution fusions are performed, the processing flow of the first cross-convolution fusion is similar to the above-mentioned step S2031, step S2032, and step S2033, which will not be described in detail in this embodiment of the present application.

In the embodiment where multiple cross-convolution fusions are performed, the processing flow of the i-th cross-convolution fusion is shown in Figure 7. Figure 7 shows the i-th cross-convolution fusion and candidate position information extraction steps in the embodiment of the present application. The processing flow chart of , where 2≤i≤K, K is the preset number of times. The process includes:

S2035. Use the i-1th fusion result as the i-th input feature, and perform convolution processing to obtain the i-th convolution recognition result.

Figure 8 is an example diagram of the network structure of the i-th cross-convolution fusion according to this embodiment of the present application. As shown in FIG. 8 , the control unit 102 can use the i-1th fusion result as the i-th input feature, and perform convolution processing to obtain the i-th convolution recognition result.

S2036. The i-1th fusion result is subjected to residual processing to obtain the i-th residual identification result.

As shown in Figure 8, the control unit 102 can combine the Relu function and the identity map through the convolution layer 801, the convolution layer 802, the convolution layer 803, the convolution layer 804, the convolution layer 805, and the convolution layer 806 (i.e., Figure The residual network composed of the Add function and accumulation function shown in 8 performs residual processing on the i-1th fusion result to obtain the i-th residual identification result.

In the residual network in Figure 8, the convolutional layer 802, the convolutional layer 803 and the corresponding identity mapping form a residual block, and the convolutional layer 804, the convolutional layer 805 and the corresponding identity mapping form another residual block. . It can be seen that the residual network in the example of Figure 8 includes two residual blocks. In practical applications, the number of residual blocks in the residual network may be one or more according to actual needs. The embodiment of the present application does not limit the number of residual blocks.

S2037. Fusion of the i-th convolution recognition result and the i-th residual recognition result to obtain the i-th fusion result.

In the embodiment of the present application, as shown in FIG. 8 , the control unit 102 can fuse the convolution recognition result output by the convolution layer 807 and the residual recognition result output by the convolution layer 806 through a concat operation to obtain a fusion result.

S2038. Obtain multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter from the i-th fusion result.

In the embodiment of the present application, according to actual needs, the control unit 102 can obtain the candidate position information of the target object from any fusion result (i.e., the i-th time). Preferably, the control unit 102 can obtain more accurate candidate position information of the target object in the K-th fusion result. Therefore, in some embodiments, after K times of cross-convolution fusion, the control unit 102 may obtain candidate position information of the target object from the last (Kth) fusion result.

The control unit 102 obtains the candidate position information of the target object from the fusion result is described in detail below.

In the embodiment of the present application, the control unit 102 can extract multiple fusion results from the fusion results through multiple convolution layers and related activation functions, such as Relu function (linear correction function), Sigmoid function (S-shaped growth curve function), etc. feature map to obtain the candidate position information of the target object. The fused feature map includes candidate position information. The candidate position information includes multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter. Control Unit 102 extracts multiple fusion feature maps from the fusion results through multiple convolution layers and related functions. Through the fusion feature maps, multiple candidate position parameters of the target object in the environment image and each candidate can be clearly and accurately represented. The confidence corresponding to the position parameters, so analyzing these fused feature maps can determine the candidate position information of the target object more quickly and accurately. Figure 9 is a structural example diagram of multiple convolution layers and related functions used to obtain candidate location information in the embodiment of the present application. As shown in Figure 9, the fusion result after being processed by the convolution layer 901 is processed in three processing methods.

The first processing method is: processing through the convolution layer 902, Relu function and convolution layer 903, and normalizing it to between 0 and 1 through the sigmoid function, thereby obtaining the feature map 1. The feature map 1 shows the target object the corresponding confidence level. As shown in Figure 10, Figure 10 is a schematic diagram of the sigmoid function in the embodiment of the present application. The abscissa axis x represents the pixel value, and the ordinate axis y represents probability (confidence). Feature map 1 can be represented by 1×1×128×128, where the first 1 represents an image, and the second 1 represents a parameter, that is, the probability of whether each pixel contains a target object, which can be represented by the confidence obj_value . 128×128 represents the size of the feature map, and the feature map 1 has 128×128 pixels. Figure 11 is an example diagram of feature diagram 1 in the embodiment of the present application. As shown in Figure 11, the feature map 1 has 128×128 pixels, and the parameters on each pixel represent the probability of whether the pixel contains a target object.

The second processing method is to process the convolution layer 904, the Relu function and the convolution layer 905 to obtain the feature map 2 and the feature map 3. Among them, the output of the convolution layer 905 can be represented by 1×2×128×128. Among them, 1 represents an image, and 2 represents two sets of 128×128 feature map outputs. The value of the feature map represents the size of x and y, represented by x_value and y_value, that is, the number of x_value and y_value is 128.

The third processing method is to process the convolution layer 906, the Relu function and the convolution layer 907 to obtain the feature map 4 and the feature map 5. Among them, the output of the convolution layer 907 can be represented by 1×2×128×128. Among them, 1 represents an image, and 2 represents the output of two sets of 128×128 feature maps. The value of the feature map represents the size of w and h, represented by w_value and h_value, that is, the number of w_value and h_value is 128.

According to the above processing method, five fused feature maps can be obtained, namely feature map 1, feature map 2, feature map 3, feature map 4, and feature map 5. These five fused feature maps are used to represent candidate location information. Figure 12 is a schematic diagram of candidate location information in an embodiment of the present application. In Figure 12, the fusion feature map corresponding to obj_value at the bottom is feature map 1 in Figure 9, the fusion feature map corresponding to x_value at the top is feature map 2 in Figure 9, and the feature map corresponding to y_value at the top is the fusion feature in Figure 9 In Figure 3, the fusion feature map corresponding to the middle w_value is feature map 4 in Figure 9, and the fusion feature map corresponding to the middle h_value is feature map 5 in Figure 9. As shown in Figure 12, the candidate position information includes multiple candidate position parameters of the target in the environment image, that is, the candidate position parameters in feature map 1, feature map 2, feature map 3, and feature map 4, respectively x_value represents the target The x coordinate of the object in the environment image, y_value represents the y coordinate of the target object in the environment image, w_value represents the width of the target object in the environment image, and h_value represents the height of the target object in the environment image. The candidate location information also includes the confidence corresponding to each candidate location parameter, that is, the confidence parameter obj_value in feature map 1.

The embodiment of the present application provides an example as shown in Figure 9 for obtaining candidate position information about the target object from the fusion result. In practical applications, the control unit 102 may also obtain candidate position information about the target object from the fusion result through other methods, which is not limited in the embodiments of the present application.

It can be understood that the x-coordinate of the above-mentioned target object in the environment image may refer to the x-coordinate of the upper left corner vertex of the target object, or the x-coordinate of the center point of the target object, or a specified relationship with the target object. The coordinates of the point are not limited in the embodiment of this application. In the same way, the above-mentioned y-coordinate of the target object in the environment image may refer to the y-coordinate of the upper left corner vertex of the target object, or the y-coordinate of the center point of the target object, or the specified point associated with the target object. Coordinates are not limited in the embodiments of this application.

S204. Determine the target position information of the target object in the environment image based on the multiple candidate position parameters and the confidence corresponding to each candidate position parameter.

In this embodiment of the present application, the control unit 102 may filter multiple candidate position parameters to a suitable candidate position parameter as the target position information of the target object in the environment image based on the confidence level corresponding to each candidate position parameter. For example, as shown in Figure 12, among all the confidence parameters obj_value of the fusion feature map corresponding to obj_value, assuming that the largest confidence parameter is the fusion feature map corresponding to coordinates (1,1), then the remaining four fusion features can be The candidate position parameters corresponding to coordinates (1,1) are extracted from the figure as the target position information of the target object in the environment image. Therefore, it is an implementation method to select the candidate position parameter corresponding to the maximum confidence parameter as the target position information of the target object in the environment image. This application also provides another implementation method as follows:

FIG. 13 is a flow chart of an implementation method for determining target position information of a target object in an environmental image in an embodiment of the present application. The process includes:

S2041. Extract candidate position parameters corresponding to the same position in multiple fusion feature maps and confidence levels corresponding to the candidate position parameters, and form a candidate parameter set about the target object.

In this embodiment of the present application, the control unit 102 can extract the candidate position parameters corresponding to the same position and the confidence corresponding to the candidate position parameters. For example, for the position of coordinates (1,1), in the five fused feature maps as shown in Figure 12 , each fusion feature map selects the parameters corresponding to the coordinate (1,1) position, that is, x _1,1 , y _1,1 , w _1,1 , h _1,1 and obj _1,1 , then the coordinates can be obtained The set of candidate parameters (x _1,1 , y _1,1 , w _1,1 , h _1,1 , obj _1,1 ) corresponding to the (1,1) position. Therefore, in this embodiment of the present application, one position corresponds to one candidate parameter set. For example, the candidate parameter set corresponding to the position of coordinate (1,1) is (x _1,1 , y _1,1 , w _1,1 , h _1,1 , obj _1,1 ).

By analogy, for the candidate location information as shown in Figure 12, 128 sets of candidate parameters (x_value, y_value, w_value, h_value, obj_value) can be extracted. The amount of data is 128*128*5 data, which can be expressed by the following formula :

Among them, x _0,0 represents the x_value corresponding to the coordinate (0,0) position, and so on for other parameters, which will not be described here.

S2042. Based on the comparison result between the confidence level and the preset confidence threshold, select the target candidate parameter set from the candidate parameter set.

In this embodiment of the present application, the control unit 102 can extract corresponding confidence levels from all candidate parameter sets, and compare each confidence level with a preset confidence threshold, thereby obtaining a ratio of the confidence level to the preset confidence threshold. to the results. It can be understood that the comparison result between the confidence level and the preset confidence threshold can determine a set of confidence levels greater than the preset confidence threshold, a set of confidence levels smaller than the preset confidence threshold, and a set of confidence levels equal to the preset confidence threshold.

In a preferred embodiment, the control unit 102 may select a candidate parameter set corresponding to a confidence level set greater than the preset confidence threshold as the target candidate parameter set. For example, if the preset confidence threshold is set to 0.7, then the candidate parameter set corresponding to a confidence level greater than 0.7 is used as the target candidate parameter set. The target candidate parameter set can be expressed by the following formula:

In one case, the target candidate parameter set is an empty set, which means that no target object is found in the current environment image. Then the control unit 102 can control the mobile robot 100 to rotate left or right until a target object is detected in the environment image.

In other cases, if the target candidate parameter set has more than one candidate parameter set, the target position information of the target object can be determined through step 2043.

S2043. Determine the target position information of the target object according to the candidate position parameters in the target candidate parameter set.

In one case, the target candidate parameter set includes one candidate parameter set, then the control unit 102 can directly use this candidate parameter set as the target position information of the target object.

In another case, the target candidate parameter set has multiple groups of candidate parameter sets, then the control unit 102 can select a candidate position parameter that meets the preset conditions from the target candidate parameter set as a new target candidate parameter set according to the preset conditions. , to achieve further screening. The embodiments of this application provide better implementation methods as follows:

When the confidence in the candidate parameter set is the maximum, the control unit 102 determines the candidate parameter set corresponding to the maximum confidence as the target candidate parameter set. For example, the target candidate parameter set includes three candidate parameter sets, namely (x _1,1 , y _1,1 , w _1,1 , h _1,1 , obj _1,1 ), (x _2,2 , y _2,2 , w _2,2 , h _2,2 , obj _2,2 ) and (x _3,3 , y _3,3 , w _3,3 , h _3,3 , obj _3,3 ). The control unit 102 detects that among obj _1,1 , obj _2,2 , and obj _3,3 , obj _1,1 has the largest value. Then the control unit ₁₀₂ can determine the candidate parameter set (x _1,1 , y _1,1 , w _1,1 , h _1,1 , obj 1,1 ) corresponding to obj _1,1 as the target candidate parameter set. In this embodiment, by determining the candidate parameter set corresponding to the maximum confidence level as the target candidate parameter set, better recognition accuracy can be achieved.

Finally, the control unit 102 may determine the target position information of the target object based on the candidate position parameters in the target candidate parameter set. For example, if the target candidate parameter set is (x _1,1 , y _1,1 , w _1,1 , h _1,1 , obj _1,1 ), it can be determined as the x coordinate of the target object in the environment image, y _{1 ,1} is the y coordinate of the target object in the environment image, w _1,1 is the width of the target object in the environment image, h _1,1 is the height of the target object in the environment image.

In this embodiment, suitable candidate location parameters are screened based on the confidence level corresponding to the candidate location parameters, thereby improving the recognition accuracy.

S205. Control the motion state of the mobile robot according to the target position information so that the mobile robot can dock with the target object.

In this embodiment of the present application, after the control unit 102 determines the target position information of the target object in the environment image, it can control the motion state of the mobile robot according to the position information, so that the mobile robot docks with the target object. The motion state may include but is not limited to the direction of the motion posture and the speed of the forward motion. The direction of the motion posture may be to the left, right, forward, or offset by a certain angle. Specifically, FIG. 14 is a flow chart for controlling the motion state of the mobile robot in the embodiment of the present application. The process includes:

S2051. Obtain the center point position of the environment image.

In the embodiment of the present application, the control unit 102 first determines the center point position of the environment image. It can be understood that the control unit 102 can establish a coordinate system using the vertices in the environment image as the origin, for example, using the left vertex of the environment image as the origin, so as to describe the center point position of the environment image through the coordinate system. For example, as shown in Figure 15, the upper left corner vertex of the environment image is used as the origin of the coordinate system. The width and height of each frame of the environment image is 640*480, then the upper right corner vertex of the environment image is (640,0), the lower left corner vertex is (0,480), the lower right corner vertex is (640,480), and the center point position of the environment image can be (640/2, 480/2) is (320, 240).

S2052. Calculate the center point position of the target object based on the target position information.

In the embodiment of the present application, after the control unit 102 determines the x coordinate, y coordinate, width and height of the target object, it can identify the upper left corner coordinate (x ₀ , y ₀ ) and lower right corner coordinate (x ₁ , y ) of the target object. ₁ ). For example, in the preset position information, the x coordinate and y coordinate are the coordinates of the upper left corner vertex of the target object, then x ₀ is the x coordinate, and y ₀ is the y coordinate. Then the coordinate x ₁ of the lower right corner is equal to the x coordinate plus the width w, and y ₁ is the y coordinate minus the height h.

After the control unit 102 recognizes the coordinates of the upper left corner (x ₀ , y ₀ ) and the lower right corner (x ₁ , y ₁ ) of the target object, it can calculate the center point position (X _c , Y _c ) of the target object. The center point position (X _c , Y _c ) of the target object can be expressed by the formula as follows:

X _c =(x ₀ +x ₁ )/2.

Y _c =(y ₀ +y ₁ )/2.

S2053. According to the offset between the center point position of the target object and the center point position of the environment image, adjust the motion state of the mobile robot so that the mobile robot docks with the target object.

In the embodiment of the present application, the control unit 102 can adjust the motion state of the mobile robot according to the offset between the center point position of the target object and the center point position of the environment image. In this embodiment of the present application, the positive and negative values of the offset may indicate the direction of the offset. For example, if the offset corresponding to the x-coordinate is positive, it means that the offset direction is to the right; if the offset corresponding to the x-coordinate is negative, it means the offset direction is to the left. The size of the offset can indicate the degree of offset. It can be understood that an offset of 0 means there is no offset. Therefore, the control unit 102 can determine the offset direction and offset size of the target object from the center of the environment image based on the offset between the center point position of the target object and the center point position of the environment image, thereby controlling the mobile robot to move without deviation. The mobile robot moves in a shifting state, allowing the mobile robot to dock with the target object.

In this embodiment, by comparing the center point position of the target object with the center point position of the environment image, the offset between the mobile robot and the target object can be accurately identified, thereby adjusting the motion state of the mobile robot according to the offset. Allows the mobile robot to accurately dock with the target.

The embodiment of the present application provides a specific implementation method including the following steps: determining the forward direction and forward distance of the mobile robot based on the offset. Adjust the movement posture direction of the mobile robot according to the forward direction. Control the moving distance of the mobile robot along the direction of the motion posture so that the mobile robot can dock with the target object. Among them, if the offset direction is determined to be the left according to the offset, then the forward direction of the mobile robot can be determined to be forward left; if the offset direction is determined to be to the right according to the offset, then the forward direction of the mobile robot can be determined To move forward to the right. The forward distance can be positively related to the size of the offset. In this embodiment, the forward direction and forward distance of the mobile robot are determined based on the offset, which can more accurately control the movement attitude direction and forward distance of the mobile robot, and improve the docking effect between the mobile robot and the target object.

For example, take the docking of a mobile robot to a charging base as an example. In this example, the mobile robot may be a four-legged mobile robot. Among them, the quadruped includes two driving wheels and two steering wheels located on the left and right sides of the quadruped robot. Among them, the driving wheel is the rear wheel with driving force, and the steering wheel is the front wheel without independent power, which can also be called the driven wheel. For example, the width and height of the image read by a camera with a specification of 640*480 is 640*480. You can set the center point of the image to 640/2=320. The height of the recharge base is determined according to the actual physical position and should be within In the field of view of the camera, for example, the range of the height of the charging base can be (0 ~ 480). In this example, the height of the charging base is set to 200. When the center point of the charging base is equal to 320, that is, the charging base is at the center of the image. This Then drive the quadruped robot forward until it touches the charging base. When the center point position of the charging base is less than 320, that is, when the charging base is located to the left of the center point of the image, the quadruped robot is driven to move to the left. The specific driving method includes: the right driving wheel starts to drive to the right, and the left driving wheel does not. Use power to drive the two steering wheels to the left to achieve the purpose of turning left. At the same time, continue to turn left until the center point of the charging base is equal to the center point of the image. Stop turning and go straight forward. . If the center point position of the charging base is greater than 320, that is, the charging base is on the right side of the image, control the quadruped robot to start moving to the right. The steering method is the same as turning left, and will not be described again here.

Figure 15 is one of the schematic diagrams of a target object in an environmental image in an embodiment of the present application. In Figure 15, the origin coordinate is (0,0), and the vertex coordinate of the environment image is (640,0). As shown in Figure 15, the center point 1502 of the target object is on the right side of the center point 1501 of the environment image. Then the control unit 102 may determine that the forward direction of the mobile robot is forward to the right, and the right arrow shown in FIG. 15 is the forward direction to the right of the mobile robot.

Figure 16 is the second schematic diagram of the target object in the environment image in the embodiment of the present application. In Figure 16, the origin coordinate is (0,0), and the vertex coordinate of the environment image is (640,0). As shown in Figure 16, the position of the center point 1602 of the target coincides with the position of the center point 1601 of the environment image, then the control The unit 102 may determine that the forward direction of the mobile robot is forward, and the forward arrow shown in FIG. 16 is the forward direction of the mobile robot.

Figure 17 is the third schematic diagram of the target object in the environment image in the embodiment of the present application. In Figure 17, the origin coordinate is (0,0), and the vertex coordinate of the environment image is (640,0). As shown in Figure 17, the center point 1702 of the target object is to the left of the center point 1701 of the environment image. Then the control unit 102 may determine that the forward direction of the mobile robot is forward left, and the left arrow shown in FIG. 17 is the forward left direction of the mobile robot.

Then the control unit 102 can adjust the movement posture direction of the mobile robot according to the forward direction. For example, if the forward direction is forward to the right, the control unit 102 can adjust the steering wheel of the mobile robot to rotate to the right, thereby adjusting the direction of the mobile robot's movement posture. If the forward direction is forward to the left, the control unit 102 can adjust the steering wheel of the mobile robot to rotate to the left, thereby adjusting the direction of the movement attitude of the mobile robot.

It can be understood that the steering wheels of the mobile robot may be the two front wheels of the mobile robot or the two rear wheels of the mobile robot, which is not limited in the embodiments of the present application.

It can be understood that by adjusting the motion state of the mobile robot 100 in the above manner, the mobile robot 100 can be moved toward the target object, thereby allowing the mobile robot 100 to dock with the target object. In the return-to-home charging scenario, the target object may be a target charging stand, and the mobile robot 100 moves toward and docks with the target charging stand during the return-to-home charging process to achieve charging.

FIG. 18 is a schematic diagram of the internal modules of a control unit 102 provided by an embodiment of the present application. The internal modules of the control unit 102 include:

The acquisition module 1801 is used to execute or implement step 201 in the various embodiments corresponding to Figure 2 mentioned above.

The processing module 1802 is used to execute or implement

steps

202, 203, 204, and 205 in the respective embodiments corresponding to FIG. 2 .

Figure 19 is a schematic diagram of a terminal device provided by an embodiment of the present application. The terminal device 1900 may be the mobile robot 100 in the above embodiment. The terminal device 1900 includes a memory 1902, a processor 1901, and a computer program 1903 stored in the memory 1902 and executable on the processor 1901. When the processor 1901 executes the computer program 1903, the implementation is shown in Figures 2, 5, 7, and 13 or the method of each embodiment corresponding to Figure 14.

It should be noted that the information interaction, execution process, etc. between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For details of their specific functions and technical effects, please refer to the method embodiments section. No further details will be given.

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the steps in each of the above method embodiments can be implemented.

Embodiments of the present application provide a computer program product. When the computer program product is run on a mobile terminal, the steps in each of the above method embodiments can be implemented when the mobile terminal is executed.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, this application can implement all or part of the processes in the methods of the above embodiments by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. The computer program When executed by a processor, the steps of each of the above method embodiments may be implemented. Wherein, the computer program includes computer program code, which may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to the camera device/terminal device, recording media, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. For example, U disk, mobile hard disk, magnetic disk or CD, etc. In some jurisdictions, subject to legislation and patent practice, computer-readable media may not be electrical carrier signals and telecommunications signals.

In the above embodiments, each embodiment is described with its own emphasis. For parts that are not detailed or documented in a certain embodiment, please refer to the relevant descriptions of other embodiments.

Claims

A target docking method based on image recognition, including:

Obtain the environment image of the mobile robot as it moves toward the target;

Extract an initial feature map about the target object from the environment image;

The initial feature map is subjected to a preset number of cross-convolution fusions to extract multiple candidate position parameters of the target object in the environment image and the confidence corresponding to each candidate position parameter; wherein, The cross-convolution fusion includes performing different convolution residual processing on the initial feature map and fusing the results obtained by different convolution residual processing;

Determine the target position information of the target object in the environment image according to the plurality of candidate position parameters and the confidence corresponding to each candidate position parameter;

The motion state of the mobile robot is controlled according to the target position information, so that the mobile robot docks with the target object.
The method of claim 1, wherein the initial feature map is subjected to a preset number of cross-convolution fusions to extract multiple candidate position parameters of the target object in the environment image and each of the candidate position parameters. The confidence level corresponding to the candidate location parameters includes:

Use the initial feature map as the input feature of the first cross-convolution fusion, and perform convolution processing to obtain the convolution recognition result;

The initial feature map is subjected to residual processing to obtain a residual identification result;

Fusion of the convolution recognition result and the residual recognition result to obtain a fusion result;

A plurality of candidate position parameters of the target object in the environment image and a confidence level corresponding to each candidate position parameter are obtained from the fusion result.
The method of claim 2, wherein the initial feature map is subjected to a preset number of cross-convolution fusions to extract multiple candidate position parameters of the target object in the environment image and each of the candidate position parameters. The confidence corresponding to the candidate position parameters also includes:

The i-1th fusion result is used as the i-th input feature, and convolution processing is performed to obtain the i-th convolution recognition result;

The i-1th fusion result is subjected to residual processing to obtain the i-th residual identification result;

Fusion of the i-th convolution recognition result and the i-th residual recognition result to obtain the i-th fusion result;

Obtain multiple candidate position parameters of the target object in the environment image and the confidence level corresponding to each candidate position parameter from the i-th fusion result;

Among them, 2≤i≤K, K is the preset number of times.
The method of claim 1, wherein the fusion result includes a plurality of fusion feature maps corresponding to the environment image, and the confidence level corresponding to the plurality of candidate position parameters and each candidate position parameter is Determining the target position information of the target object in the environment image includes:

Extract the candidate position parameters corresponding to the same position in the multiple fusion feature maps and the confidence corresponding to the candidate position parameters, and form a candidate parameter set about the target object, wherein one of the same positions Corresponding to one of the candidate parameter sets;

According to the comparison result between the confidence level and the preset confidence threshold, select a target candidate parameter set from the candidate parameter set;

Target position information of the target object is determined according to the candidate position parameters in the target candidate parameter set.
The method of claim 4, wherein filtering out a target candidate parameter set from the candidate parameter set based on a comparison result between the confidence level and a preset confidence threshold includes:

When the confidence in the candidate parameter set of the comparison result is the maximum, the candidate parameter set corresponding to the maximum confidence is determined as the target candidate parameter set.
The method according to any one of claims 1 to 5, wherein controlling the motion state of the mobile robot according to the target position information so that the mobile robot docks with the target includes:

Obtain the center point position of the environment image;

Calculate the center point position of the target object based on the target position information;

According to the offset between the center point position of the target object and the center point position of the environment image, the motion state of the mobile robot is adjusted so that the mobile robot docks with the target object.
The method of claim 6, wherein the motion state includes a motion posture direction, and the adjustment is performed according to the offset between the center point position of the target object and the center point position of the environment image. The motion state of the mobile robot so that the mobile robot docks with the target object includes:

Determine the forward direction and forward distance of the mobile robot according to the offset;

Adjust the movement posture direction of the mobile robot according to the forward direction;

The mobile robot is controlled to move the forward distance along the motion posture direction so that the mobile robot docks with the target object.
The method of claim 7, wherein the advance distance is positively correlated with the offset.
A terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements any one of claims 1 to 8 the method described.
A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.