CN112906691B

CN112906691B - Distance measurement method and device, storage medium and electronic equipment

Info

Publication number: CN112906691B
Application number: CN202110139380.7A
Authority: CN
Inventors: 蒋海滨
Original assignee: Shenzhen Anngic Technology Co ltd
Current assignee: Shenzhen Anngic Technology Co ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2023-11-24
Anticipated expiration: 2041-01-29
Also published as: CN112906691A

Abstract

The application relates to the technical field of distance measurement, and provides a distance measurement method and device, a storage medium and electronic equipment. The ranging method comprises the following steps: detecting a target in the original image by using a target detection model and extracting the full-image characteristics of the original image; acquiring an interested region image according to the position of an interested region corresponding to an object to be detected in an original image, wherein the detected object comprises the object to be detected, the interested region is determined according to a detection frame of the object to be detected, and the interested region image is a part of the original image in the interested region; and obtaining a ranging result between the vehicle and the target to be measured by using a ranging model based on the position of the region of interest, the region of interest image and the full-image characteristic. The method is a monocular ranging method, has low implementation cost, and has the advantages of small operand and high ranging precision.

Description

Distance measurement method and device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of distance measurement, in particular to a distance measurement method and device, a storage medium and electronic equipment.

Background

In advanced driving assistance systems (Advanced Driving Assistance System, abbreviated as ADAS) and in the field of autopilot, it is necessary to perceive objects of the surrounding environment, which involves detecting and ranging the surrounding objects. The current distance measuring method is mainly divided into a monocular distance measuring method and a binocular distance measuring method. Among them, binocular ranging is costly to implement because it requires a specific hardware device to provide support. Although the monocular ranging is easy to implement, the method has the defects of large calculation amount and low ranging precision.

Disclosure of Invention

An objective of an embodiment of the present application is to provide a ranging method and apparatus, a storage medium, and an electronic device, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides a ranging method, including: detecting a target in an original image by using a target detection model and extracting full-image features of the original image; the target detection model is a convolutional neural network model; acquiring an interested region image according to the position of an interested region corresponding to the target to be detected in the original image; the target to be detected is included in the detected target, the region of interest is determined according to a detection frame of the target to be detected, and the region of interest image is a part of the original image in the region of interest; based on the position of the region of interest, the region of interest image and the full-image feature, a ranging result between a vehicle and the target to be measured is obtained by using a ranging model; wherein, the ranging model is a convolutional neural network model.

Compared with the existing ranging method, the ranging method has the following advantages: firstly, the method is a monocular ranging method, ranging can be completed only by using an original image acquired by a common camera, and the implementation cost is low; secondly, the ranging model in the method mainly calculates a ranging result based on the region-of-interest image, but not based on the whole original image, so that the required operation amount is low; thirdly, in the region of interest image, mainly comprises a single target to be detected, and other contents in the image comprise less, so that higher ranging precision can be achieved under the condition of less training samples; fourthly, the distance between the object to be measured and the vehicle can be predicted by the distance measurement model in principle through two pieces of information of the position of the region of interest and the image of the region of interest, and the background information in the original image is contained in the full-image feature, so that the three pieces of information are used for calculating the distance measurement result, and the distance measurement accuracy is further improved.

In an implementation manner of the first aspect, after the detecting the target in the original image by using the target detection model and extracting the full-image feature of the original image, and before the acquiring the region-of-interest image according to the position of the region-of-interest corresponding to the target to be detected, the method further includes: and screening the targets with collision risk with the vehicle from the detected targets to serve as the targets to be detected.

In practical applications, one of the main reasons for ranging is to avoid collision between the vehicle and a target that is closer to the front. However, there may be many targets in front of the vehicle, not all of which are at risk of collision with the host vehicle, for example, targets that are farther apart may not collide with the host vehicle in a short time, and thus it is not necessary to range all of the detected targets at the time of real-time ranging. In the implementation mode, the distance measurement is only carried out on the screened targets with collision risk with the vehicle, so that the distance measurement efficiency is improved, and the computing resources are saved.

In an implementation manner of the first aspect, after the detecting the target in the original image by using the target detection model and extracting the full-image feature of the original image, and before the acquiring the region-of-interest image according to the position of the region-of-interest corresponding to the target to be detected, the method further includes: and determining a region formed by expanding the detection frame of the target to be detected at the original position according to a preset proportion as a region of interest corresponding to the detection frame in the original image.

The detection frame of the object to be detected can be directly determined as the corresponding region of interest in the original image. However, in consideration of the fact that the position of the detection frame of some targets may not be predicted accurately, in the implementation manner, the size of the detection frame is properly enlarged to obtain the region of interest, so that the region of interest contains the complete target to be detected.

In an implementation manner of the first aspect, the obtaining, by using a ranging model, a ranging result between a vehicle and the target to be measured based on the position of the region of interest, the region of interest image, and the full-image feature includes: normalizing the numerical value representing the position of the region of interest and expanding the numerical value into a feature map; each normalized numerical value is expanded to generate a corresponding characteristic diagram, and pixel values in the characteristic diagram are all normalized numerical values; scaling the region-of-interest image to a preset size and then normalizing to obtain a normalized region-of-interest image; and inputting the feature map generated by expansion, the normalized region-of-interest image and the full-map feature into the ranging model for forward propagation, and obtaining the ranging result between the vehicle and the target to be measured, which is output by the ranging model.

In the above implementation, the normalization operation can allow the model to converge faster during training, while model accuracy can be improved (although not a training step here, if the normalization operation is performed during training, the operation should be preserved during the inference phase). In addition, the object processed by the convolutional neural network model is typically an image (including the original image and the feature map), so that a single numerical value representing the position of the region of interest (e.g., the coordinates of the center of the region, the width, the height, etc.) should be converted into the form of an image before being input into the ranging model so that the model can be processed.

In an implementation manner of the first aspect, the inputting the feature map generated by the expansion, the normalized image of the region of interest, and the full-map feature to the ranging model for forward propagation, to obtain the ranging result between the vehicle and the target to be measured output by the ranging model, includes: extracting features of the normalized region-of-interest image by using the ranging model to obtain region-of-interest features; fusing the region of interest features, the feature map generated by expansion and the full map features by using the ranging model to obtain fused features; and predicting the distance between the vehicle and the target to be measured based on the fusion characteristic by using the ranging model to obtain the ranging result.

The above implementation describes three main functions of the ranging model: firstly, calculating the characteristics of the region of interest based on the normalized region of interest image; secondly, fusing the region of interest characteristics, the characteristic map generated by expansion and the full map characteristics to obtain fusion characteristics; thirdly, the distance between the vehicle and the target to be measured is predicted based on the fusion characteristics, so that the network structure of the ranging model can be designed aiming at the three functions. In addition, three items of input information are fused deeply, so that the distance measurement accuracy of the model is improved.

In an implementation manner of the first aspect, the performing feature extraction on the normalized region of interest image by using the ranging model to obtain a region of interest feature includes: performing feature extraction and downsampling on the normalized region-of-interest image by using at least one first convolution unit at the beginning of the ranging model to obtain a first intermediate feature; wherein the first convolution unit comprises a convolution layer and a pooling layer; sequentially extracting the first intermediate feature by using a plurality of residual blocks in the ranging model, and summing the extracted feature of the last residual block with the feature extracted by at least one residual block in the residual blocks by using a short-circuit structure of the last residual block to obtain a second intermediate feature; further fusing the second intermediate features by using a second convolution unit in the ranging model to obtain the region-of-interest features; wherein the second convolution unit comprises two convolution layers.

The above implementation gives one possible network structure for computing the region of interest characteristics. The first convolution unit with the downsampling function is arranged at the initial position of the ranging model, so that the size of an input image is reduced rapidly, and the operation amount is saved. The residual blocks can extract features with different depths, the receptive fields and the content emphasis points of the features are different, the shallower features contain abundant detail information in the input image, the deeper features contain more semantic information in the input image, and therefore the features with different depths are fused in a mode of summing (utilizing a short circuit structure) and then convolving (utilizing a second convolution module), and the obtained features of the region of interest have better expression capability.

In an implementation manner of the first aspect, the fusion feature includes a plurality of channels, each channel corresponds to a preset distance range, the predicting, by using the ranging model, a distance between the vehicle and the target to be measured based on the fusion feature, and obtaining the ranging result includes: predicting the distance between the vehicle and the target to be detected and positioned in the distance range corresponding to the channel and the confidence thereof based on the fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model; after the obtaining the ranging result, the method further comprises: and determining the distance with the highest confidence as the final distance between the vehicle and the target to be detected.

In the implementation manner, the target ranging tasks in different distance ranges are allowed to correspond to different feature maps (i.e. one channel of the fusion feature), so that a set of dedicated model parameters can be trained for each target in each distance range to perform distance prediction, thereby being beneficial to improving ranging accuracy.

In an implementation manner of the first aspect, the fusion feature includes a plurality of channels, each channel corresponds to a distance range, the predicting, by using the ranging model, a distance between the vehicle and the target to be measured based on the fusion feature, and obtaining the ranging result includes: predicting a distance coefficient and a confidence coefficient between the vehicle and the target to be detected, which are effective in a distance range corresponding to the channel, based on fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model; after the obtaining the ranging result, the method further comprises: determining the distance coefficient with the highest confidence as a final distance coefficient; and calculating the product of the final distance coefficient and the reference distance in the effective distance range to obtain the distance between the vehicle and the target to be detected.

In the implementation manner, the target ranging tasks in different distance ranges are allowed to correspond to different feature maps (i.e. one channel of the fusion feature), so that a set of dedicated model parameters can be trained for each target in each distance range to perform distance prediction, thereby being beneficial to improving ranging accuracy. Moreover, the above implementation does not directly predict the distance, but predicts a distance coefficient and then calculates the distance, which has the advantages that: the fluctuation range of the distance is larger, and the direct prediction distance may be inaccurate, but if the reference distance is properly set, the fluctuation range of the distance coefficient is smaller, the prediction result is more accurate, so that the distance measurement precision can be further improved, and the convergence speed of the model in training can be accelerated.

In one implementation of the first aspect, the reference distance within any one of the distance ranges is a median value of the distance ranges.

If the reference distance is set to the median value of the corresponding distance range, the distance coefficient can be maintained within the interval of [0,2], with less fluctuation.

In an implementation manner of the first aspect, after the obtaining the region of interest feature, the method further includes: and re-predicting the category and/or detection frame of the target to be detected based on the region of interest characteristics by using the ranging model.

Although the object detection model has detected the object, the object detection model is based on the original image, the original image is likely to be scaled before detection, which results in that the size of some objects in the original image becomes very small, and the accuracy of the corresponding detection result is reduced. The image of the region of interest is directly cut out from the original image by using the position of the region of interest, and is not scaled, so that the size of the target is larger, even if the image of the region of interest is scaled to a certain extent before being sent into the ranging model, the size of the target does not become too small, so that the detection frame and/or classification of the target to be detected is predicted once again based on the characteristics of the region of interest, and a more accurate result may be obtained.

In an implementation manner of the first aspect, before the detecting an object in an original image using an object detection model and extracting a full-image feature of the original image, the method further includes: training the initial target detection model to obtain a preliminarily trained target detection model; training the initial ranging model to obtain the primarily trained ranging model; the method comprises the steps of training, namely, only using the position of an interested region in a training image and an interested region image, and obtaining a target detection result of the training image based on a preliminarily trained target detection model without using full-image features and the required position of the interested region and the interested region image; continuously training the preliminarily trained target detection model and the preliminarily trained ranging model to obtain the trained target detection model and the trained ranging model; the distance measurement model is trained by using the position of the region of interest, the region of interest image and the full-image feature in the training image, the required position of the region of interest and the region of interest image are obtained based on the target detection result of the current target detection model on the training image, and the required full-image feature is extracted from the training image by the current target detection model.

The three-stage training mode is beneficial to accelerating the convergence speed of the model and improving the ranging precision. In the third stage of the training process, the ratio of the loss of the distance prediction to the total loss can be properly increased, because the loss of the distance prediction is generally smaller, and the increase of the loss ratio is beneficial to improving the ranging accuracy of the model.

In a second aspect, an embodiment of the present application provides a ranging apparatus, including: the target detection module is used for detecting a target in an original image by using a target detection model and extracting the full-image characteristics of the original image; the target detection model is a convolutional neural network model; the interested region processing module is used for acquiring an interested region image according to the position of the target to be detected in the corresponding interested region in the original image; the target to be detected is included in the detected target, the region of interest is determined according to a detection frame of the target to be detected, and the region of interest image is a part of the original image in the region of interest; the ranging module is used for obtaining a ranging result between a vehicle and the target to be measured by using a ranging model based on the position of the region of interest, the region of interest image and the full-image feature; wherein, the ranging model is a convolutional neural network model.

In a third aspect, embodiments of the present application provide a computer readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method provided by the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of the first aspect or any one of the possible implementations of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow of a ranging method according to an embodiment of the present application;

FIG. 2 shows a structure of an object detection model provided by an embodiment of the present application;

FIG. 3 shows one configuration of a ranging model provided by an embodiment of the present application;

fig. 4 shows a structure of a distance measuring device according to an embodiment of the present application;

fig. 5 shows a structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Fig. 1 shows a flow of a ranging method according to an embodiment of the present application, where the distance to be measured (or predicted) by the ranging method is the distance between the host vehicle and other objects. The method may be, but is not limited to being, performed by an electronic device, one possible configuration of which is shown in fig. 5, see in particular the description of fig. 5 below. Referring to fig. 1, the method includes:

step S110: and detecting the target in the original image by using the target detection model and extracting the full-image characteristics of the original image.

The original image may be an image acquired by a normal camera, which may be a vehicle-mounted camera, for example, which may be mounted at a vehicle head position for acquiring an image in front of the vehicle.

The original image may contain a plurality of objects to be focused on, such as vehicles, pedestrians, riders, animals, obstacles, traffic signs, etc., which are called targets, and the targets may be detected by using a target detection model, and the possible target detection results are a target set, where each target in the target set includes at least two attributes: one is a category of objects, such as vehicles, pedestrians, etc.; one is a detection frame corresponding to the target, and the position of the detection frame can be expressed by four numerical values, for example:

x: abscissa of detection frame center in original image

y: ordinate of detection frame center in original image

w: width of detection frame

h: height of detection frame

Of course, there are various expressions of the position of the detection frame, for example, selecting the abscissa of the top left vertex of the detection frame, the width of the detection frame, and the height of the detection frame may describe the position of the detection frame, and the method is not limited to the above. In addition, the target detection model may also output information such as confidence level related to the target class and the target detection frame.

In some implementations, the object detection model is a convolutional neural network model, and the specific structure is not limited, for example, a network structure such as YOLOv3, ssd+mobilenet may be used, and fig. 2 shows another network structure that may be used by the object detection model. Some of the english terms in fig. 2 have the following meanings:

conv: representing the convolution layer, followed by a number, e.g., 3*3/1, where 3*3 represents the convolution kernel size and 1 represents the convolution step size;

concat: representing a splice structure, multiple features input to the structure will be spliced together and output.

shortcut: the short circuit structure is generally arranged at the tail end of the residual block and is used for outputting the added characteristics of the two branches in the residual block. The residual block may refer to a convolution module having the structure: after the characteristics input into the convolution module are processed by zero or more convolution layers, two branches are respectively input, wherein one branch utilizes at least one convolution layer to process the input characteristics of the branches, the processing result is output to a short circuit structure, and the other branch directly outputs the input characteristics of the branches to the short circuit structure (also called identity mapping). For example, according to this definition, conv3 x 3/2, conv3 x 3/1, and shortcut below the input image in fig. 2 form a residual block, the input features of the residual block are processed by conv3 x 3/2 and then are respectively input into two branches, one of the branches processes the input features of the branches by conv3 x 3/1, the processing result is output to the shortcut, the other branch directly outputs the input features of the branches to the shortcut, and the shortcut adds the two branches and outputs the two.

upsample: the upsampling layer is represented, and the specific upsampling mode may be deconvolution, interpolation, etc.

Referring to fig. 2, the input image in fig. 2 may be an original image or a result of preprocessing the original image, for example, the size of the original image may be relatively large, and in order to reduce the network operation amount, the original image may be reduced to a smaller size and then input into the target detection model.

The target detection model comprises 3 target analysis layers, which are mainly used for detecting a large-size target, a medium-size target and a small-size target (such as a category and a detection frame of an outputtable target) respectively from left to right, and the parts except the target analysis layers are mainly used for feature extraction and fusion. The input characteristics of the target analysis layer near the left side are deeper, and the characteristics have larger receptive fields and richer semantic information, so that the detection of targets with larger sizes is facilitated; the input characteristics of the target analysis layer near the right side are shallower, and the characteristics have smaller receptive fields and less semantic information, so that the target analysis layer is mainly used for detecting targets with smaller sizes. Of course, in the object detection model, for the object analysis layer near the right side, the input features are fused with deeper features, namely, the receptive field and the semantic information are enhanced, so that the detection accuracy is improved when detecting the object with smaller size.

In the scheme of the application, besides being used for target detection, the target detection model also extracts the full-image feature, and the full-image feature can refer to a feature of global property of the whole original image, and the feature has a larger receptive field. According to this definition, the global feature of the deep layer in the target detection model may be selected, for example, in fig. 2, the feature output by the next-to-last convolution layer (conv 3×3/1) above the leftmost target analysis layer is processed by a residual block (composed of conv3×3/1, conv1×1/1, conv3×3/1, and shortcut) to output the global feature.

Step S120: and acquiring an interested region image according to the position of the interested region corresponding to the target to be detected in the original image.

The target to be measured is a target to be subjected to distance prediction in the subsequent step, and the target to be measured is included in the targets detected in S110.

In some implementations, all the targets detected in step S110 may be regarded as targets to be measured. However, the inventors have found that in practical applications, one of the main reasons for performing distance measurement is to avoid collision between the vehicle and a target that is located closer to the front (or other direction). However, there may be many targets in front of the vehicle, and not all targets have collision risk with the host vehicle, for example, targets that are far apart may not collide with the host vehicle in a short time, and targets that are not on the road surface (such as pedestrians on the overpass) may also hardly collide with the host vehicle, so that it is not necessary to perform ranging on all detected targets when performing real-time ranging, because the ranging needs to be performed, reducing the number of targets to be ranging is beneficial to improving the ranging efficiency and saving the computing resources. Based on such considerations, in other implementations, objects that present a risk of collision with the vehicle are selected from among all objects detected, and only those selected objects are used as objects to be detected, the number of which is typically much smaller than the total number of objects detected.

As to how to screen out targets that are at risk of collision, there are a number of possible implementations: for example, some conventional monocular ranging methods can be used for rough ranging, and a target with a relatively close distance is taken as a target to be measured according to a ranging result; for another example, the vehicle radar can be used for ranging, and the target with a relatively close distance is used as the target to be measured according to the ranging result; for another example, lane lines in the original image may be detected, and then an object that is in the same lane as the host vehicle and is not blocked by other objects may be used as an object to be detected (unblocked indicates that collision may occur directly with the host vehicle), and so on.

Each target to be measured corresponds to a region of interest (Region of Interest, ROI for short) in the original image, and ideally, the region of interest should contain the corresponding target to be measured and only contain the target to be measured as much as possible.

The region of interest is determined according to the detection frame of the target to be detected, and in some implementations, the detection frame of the target to be detected can be directly used as the corresponding region of interest, and the manner of determining the region of interest is relatively simple and direct. However, the inventor researches that the position of the detection frame of some targets may not be predicted accurately, for example, the original image may be reduced to a smaller size and then input into the target detection model, if some small-size targets are included in the original image, after the original image is reduced, the small-size targets only correspond to few pixels in the image, and it is difficult to detect the small-size targets very accurately. Therefore, the size of the detection frame of the object to be detected can be properly enlarged to obtain the region of interest, which is beneficial to the inclusion of the complete object to be detected in the region of interest. The specific method comprises the following steps: and determining a region formed by expanding the detection frame of the target to be detected at the original position according to a preset proportion as a corresponding region of interest in the original image. For example, following the detection frame, the position of the region of interest is also represented by four values, respectively:

Roi_x: abscissa of the center of the region of interest in the original image

Roi_y: ordinate of the center of the region of interest in the original image

Roi_w: region of interest width

Roi_h: region of interest height

Then there are roi_x=x, roi_y=y, roi_w=1.3×w, and roi_h=1.3×h, wherein x, y, w, h represents the detection frame position of the target to be detected, and 1.3 is the predetermined ratio. The predetermined ratio is greater than 1, but should not be too large to avoid inclusion of other objects than the object to be measured in the region of interest as much as possible, e.g., a value of not more than 1.5 is desirable.

The region of interest image is defined as the portion of the original image within the region of interest, and is naturally determined after the position of the region of interest is obtained.

Step S130: and obtaining a ranging result between the vehicle and the target to be measured by using a ranging model based on the position of the region of interest, the region of interest image and the full-image characteristic.

The ranging model may be a convolutional neural network model, and the specific structure is not limited, and fig. 3 shows a network structure that may be used, which will be described later. The ranging model takes three information of the position of the region of interest, the region of interest image and the full-image feature obtained in the previous step as input, and according to different implementation manners of the ranging model, the three information may be input into the model together after being fused, or the three information may be sequentially input into the model from different positions (in a manner adopted in fig. 3), and so on. The output of the ranging model is a ranging result between the vehicle and the target to be measured, which may be a distance between the vehicle and the target to be measured, but may be an intermediate result, and the distance between the vehicle and the target to be measured may be obtained only after simple processing, which will be described in the description of fig. 3.

For the situation that a plurality of targets to be measured exist, the position of the region of interest, the region of interest image and the full-image feature of one target to be measured are input into a ranging model each time, the ranging model outputs the ranging result between the vehicle and the target to be measured, and the process is repeated for a plurality of times, so that the ranging result of all the targets to be measured can be obtained. Since a similar processing manner is adopted for each object to be measured, the case of one object to be measured is mainly taken as an example in the description.

It is also possible to perform some pre-processing of one or several of the three pieces of information before entering them into the ranging model:

for example, for the region of interest image, the region of interest image may be scaled to a preset size (for example, to an input size required by the model), then the scaling result is normalized (the value range of the pixel value is converted to between 0 and 1), and finally the normalized region of interest image is input to the ranging model.

For another example, for the location of the region of interest, according to the above description, it may generally be represented by four values, where each value may be normalized (the value range of the value is converted to between 0 and 1), then the normalized value is expanded into a corresponding feature map, the pixel values in the feature map take the normalized value (for example, the normalized value is 0.5, the size of the expanded feature map is 64×64, and then the 64×64 pixel values take 0.5), so as to finally obtain four feature maps, and then the four feature maps generated by expansion are input into the ranging model. The reason for expanding the values into feature maps is that: the object processed by the convolutional neural network model is typically an image (including the original image and the feature map) rather than a single value, so the single value can be converted for ease of unified processing of the network.

The normalization operation in the pretreatment mode is beneficial to faster convergence of the model during training, and meanwhile, the model precision can be improved. It should be noted that although the steps in fig. 1 are steps of the model inference phase rather than the model training phase, the processing of the training image during the model training phase is similar to the processing of the original image during the model inference phase, and should be preserved during the inference phase if the normalization operation described above is performed during the training phase.

It will be appreciated that the manner in which the input information is pre-processed is not limited to that which was mentioned in the example above. However, for simplicity, it is not necessary to further describe step S130 below to assume that the preprocessing operation mentioned in the above example is performed on the region-of-interest image and the position of the region-of-interest, that is, three pieces of information input into the ranging model are the normalized region-of-interest image, the feature map generated by expansion, and the full-map feature, and these three pieces of information are propagated forward in the ranging model, and the model finally outputs the ranging result.

In some implementations, step S130 may further include the sub-steps of:

step A: and extracting features of the normalized region of interest image by using a ranging model to obtain the features of the region of interest.

And (B) step (B): and fusing the characteristics of the region of interest, the characteristic map generated by expansion and the full map characteristics by using a ranging model to obtain fused characteristics.

Step C: and predicting the distance between the vehicle and the target to be measured based on the fusion characteristic by using the ranging model to obtain a ranging result.

The ranging model can be divided into three parts for the three steps A, B, C, and corresponding network structures are respectively designed. And B, deeply fusing three input information, and in C, performing distance prediction based on the fused information, thereby being beneficial to improving the ranging precision of the model, wherein the fusing mode comprises splicing, summation and the like.

For example, in fig. 3, the reference "(3)" represents a region of interest feature, the network preceding this location being used to implement the functionality of step a; the merging feature is represented at the label "(4)", the network up to the label "(3)", being used to implement the functionality of step B, and the network after the label being used to implement the functionality of step C.

Step A, B, C will be further described below in conjunction with fig. 3, it being understood that fig. 3 is merely an example, and that other network structures may be employed to implement the functionality of step A, B, C. Some of the english terms in fig. 3 have the following meanings:

concat: representing a splice structure;

shortcut: representing a shorting structure;

max pulling: representing a maximum pooling layer;

fc: representing a fully connected layer.

For step a, in some implementations, it may further comprise the sub-steps of:

step A1: and performing feature extraction and downsampling on the normalized region of interest image by using at least one first convolution unit at the beginning of the ranging model to obtain a first intermediate feature.

Wherein the first convolution unit comprises a convolution layer and a pooling layer. In fig. 3, the first convolution unit comprises two layers conv3 x 3 and maxpooling, although in the alternative, the maximum pooling may be replaced by other pooling methods, such as average pooling (average pooling). Since the first convolution unit having the downsampling function is provided at the beginning of the ranging model, the size of the input image (i.e., the normalized region of interest image) can be rapidly reduced to save the amount of computation in the subsequent steps. The first intermediate feature is the feature in fig. 3 at the label "(1)".

Step A2: and sequentially carrying out further feature extraction on the first intermediate feature by utilizing a plurality of residual blocks in the ranging model, and summing the extracted feature of the last residual block with the feature extracted by at least one residual block in the residual blocks by utilizing a short circuit structure of the last residual block to obtain a second intermediate feature.

The definition of the residual block has already been explained in the foregoing description of fig. 2 and is not repeated here. Fig. 3 shows 3 residual blocks, each consisting of conv3 x 3, conv1 x 1, conv3 x 3 and shortcut. Wherein the shorting structure of the 3 rd residual block sums the extracted feature with the feature extracted by the first convolution layer (conv 3 x 3/2) of the 2 nd residual block, and the second intermediate feature is the feature marked "(2)" in fig. 3.

Step A3: and further fusing the second intermediate features by using a second convolution unit in the ranging model to obtain the features of the region of interest.

Wherein the second convolution unit comprises two convolution layers. In fig. 3, the second convolution unit includes two layers conv3×3 and conv1×1. In step A2, although the addition operation performed by the shorting structure may be regarded as a fusion operation, the degree of fusion of the features is shallow (similar, the fusion performed by the splicing structure is only a shallow layer), and the features may be deeply fused by setting the second convolution unit.

Summarizing steps A2 and A3, a plurality of residual blocks can extract features with different depths, the receptive fields and the content emphasis points of the features are different, shallower features contain abundant detail information in an input image, deeper features contain more semantic information in the input image, and therefore the features with different depths are fused in a mode of summing (utilizing a short circuit structure) and then convolving (utilizing a second convolution module), and the obtained features of the region of interest have better expressive capacity.

Optionally, after the region of interest feature is obtained in step a, the ranging model may also be used to re-predict the class and/or detection frame of the target to be detected based on the region of interest feature. The corresponding network branches are shown at the label "(3)" in fig. 3. The significance of performing this re-prediction operation is briefly analyzed as follows:

although the object detection model has already detected the object, which also includes the object to be detected, in step S110, it is mentioned that, as a relatively common implementation, the original image is first reduced to a smaller size and then input into the object detection model, which results in that the size of some objects (especially small-sized objects) in the original image becomes very small, and the accuracy of the corresponding detection result is reduced. The image of the region of interest is directly captured from the original image by using the position of the region of interest, and is not scaled, so that the size of the target is larger, even if the image of the region of interest is scaled to a certain extent before being sent into the ranging model, the size of the target does not become too small, so that the detection frame and/or classification of the target to be detected is predicted once again based on the characteristics of the region of interest (extracted from the image of the region of interest), and more accurate results than those of the target detection model may be obtained.

For step B, two splice structures are set in the ranging network shown in fig. 3: the first splicing structure is used for splicing and expanding the generated characteristic diagram, and a second convolution module is arranged behind the characteristic diagram to deepen the fusion degree of the characteristics; the second splice structure is used to splice the full-view features, followed by a second convolution module to deepen the degree of fusion of the features. And finally, connecting a second convolution module to extract fusion characteristics.

For step C, a relatively simple implementation is to predict the distance between a vehicle and the object to be measured directly based on the fusion feature, but fig. 3 does not take this approach, and is specifically described below:

in some implementations, the fusion features obtained in step B include a plurality of channels, and each channel has a feature corresponding to a preset distance range. For example, in fig. 3, the fusion feature includes 16 channels, and the correspondence between the channels and the distance range is:

the channel is 1:0-10 m;

channel 2:10-20 meters;

......

channel 16:150-160 meters

Wherein, 0-160 meters covers the possible position of the target to be detected, and the target beyond the range can not be subjected to distance measurement temporarily because the target has no collision risk with the vehicle in a short time. The number of channels 16 of the fusion feature is preset, and this requirement is easily met by designing a corresponding network structure, for example, when setting the super parameter of the network, the number of output channels of conv1 x 1 at the label "(4)" is set to 16.

In the ranging model, a group of full-connection layers (each group comprises at least one full-connection layer) is arranged for each channel of the fusion characteristics, and the group of full-connection layers predicts the distance between the vehicle and the target to be measured and positioned in the distance range corresponding to the channel and the confidence level of the distance between the vehicle and the target to be measured based on the fusion characteristics of the corresponding channel. Thus, a distance and the confidence level thereof are predicted for each preset distance range, and the distance is regarded as a distance measurement result. Finally, the distance with the highest confidence in the ranging result can be determined as the final distance between the vehicle and the target to be measured, namely the ranging result to be finally presented.

For example, in fig. 3, a set of 2 full connection layers is provided for each channel setting of the fusion feature, for a total of 16 full connection layers. Wherein, the full connection layer set for the channel 1 predicts a distance and a confidence coefficient thereof in the range of 0-10 meters based on the fusion characteristic of the channel 1, and is marked as a distance 1 and a confidence coefficient 1 in fig. 3; the fully connected layer set for channel 2 predicts a distance in the range of 10-20 meters and its confidence based on the fusion characteristics of channel 2, denoted as "distance 2" and "confidence 2" in fig. 3, and so on, predicts 16 sets of distances and their confidence in total. Assuming that "confidence 5" in which "distance 5" corresponds is highest, the final distance output is "distance 5".

In the above implementation manner, the target ranging tasks in different distance ranges are allowed to correspond to different feature maps (i.e. one channel of the fusion feature), so that for each target in the distance ranges, a set of dedicated model parameters (for example, parameters for extracting features in conv1×1 and parameters of the full-connection layer) can be trained for performing distance prediction, instead of performing distance prediction for targets in all distance ranges by using the same set of parameters, thereby being beneficial to improving ranging accuracy.

Further, the inventors have studied and found that in the above implementation, the distance value is directly predicted, but the prediction result may be inaccurate due to a large fluctuation range of the distance value. Thus, in other implementations, the following alternatives may also be employed:

a reference distance within the range is set for each distance range, for example, the reference distance may be, but is not limited to being, set as the median of the distance ranges. For example, for the 16 distance ranges in fig. 3, the reference distances may be set to 5 meters, 15 meters, …, 155 meters, respectively. After the reference distance is defined, any distance value in any distance range can be calculated by multiplying the reference distance by a certain distance coefficient, and if the median value is defined as the reference distance, the value of the distance coefficient is necessarily located in the interval [0,2], and the fluctuation is small.

And setting a group of full-connection layers (each group comprises at least one full-connection layer) for each channel of the fusion characteristics, wherein the group of full-connection layers predicts a distance coefficient and a confidence coefficient between a vehicle and a target to be detected, which are effective in a distance range corresponding to the channel, based on the fusion characteristics of the corresponding channel. In this way, a distance coefficient and its confidence level are predicted for each preset distance range, which can be regarded as a ranging result. Then, the distance coefficient with the highest confidence in the ranging result is determined as the final distance coefficient. And finally, calculating the product of the final distance coefficient and the reference distance in the effective distance range, and obtaining the distance between the vehicle and the target to be measured, namely the distance measurement result to be finally presented.

For example, keeping the network structure in fig. 3 unchanged, the fully connected layer set for channel 1 predicts a distance coefficient and its confidence level valid in the range of 0-10 meters based on the fusion characteristics of channel 1, which may be referred to as "distance coefficient 1" and "confidence level 1"; the fully connected layer set for the channel 2 predicts one distance coefficient and its confidence coefficient valid in the range of 10-20 meters based on the fusion characteristics of the channel 2, and can be marked as distance coefficient 2 and confidence coefficient 2, and so on, to predict 16 groups of distance coefficients and their confidence coefficients in total. Assuming that "confidence 2" corresponding to "distance coefficient 2" is highest, the final distance coefficient is "distance coefficient 2", and assuming that its value is 1.2. The effective distance range of the distance coefficient 2 is 10-20 meters, and the reference distance of the distance range is 15 meters, so that the distance between the vehicle and the target to be measured can be calculated to be 15 x 1.2=18 meters.

In the above implementation manner, instead of directly predicting the distance, a distance coefficient is predicted first and then the distance is calculated, and by properly setting the reference distance, the fluctuation range of the distance coefficient is smaller (for example, the distance coefficient is located within [0,2 ]), so that the prediction result of the distance coefficient is more accurate, and the ranging accuracy can be improved. In addition, this can also increase the convergence rate of the model during training.

In summary, compared with the existing ranging method, the ranging method provided by the embodiment of the application has the following advantages:

firstly, the method is a monocular ranging method, ranging can be completed only by using an original image acquired by a common camera, and the implementation cost is low.

Secondly, the ranging model in the method is mainly used for calculating a ranging result based on the region-of-interest image instead of calculating the ranging result based on the whole original image, so that the required operation amount is low, and the requirement of real-time application can be met. In some implementations, only targets with collision risks are screened from the detected targets for ranging, the number of dangerous targets is small, and the ranging efficiency is further improved.

It should be noted that although the whole image feature (extracted by the original image) is also used for calculating the ranging result according to the method step S130, the whole image feature is incidentally extracted at the time of performing the target detection, which is an operation that is necessarily performed for the current in-vehicle system, whether or not the target is ranging, and therefore should not be regarded as a calculation load introduced by the ranging method.

Thirdly, in the region of interest image, mainly include single target to be measured, and other content contained in the image is less to can reach higher range finding precision under the condition that training sample is less.

Fourthly, the distance between the target to be measured and the vehicle can be predicted by the ranging model in principle through the two pieces of information of the position of the region of interest and the image of the region of interest, and the background information in the original image is contained in the full-image characteristic, so that the ranging model can calculate a ranging result by using the three pieces of information at the same time, and the ranging accuracy can be further improved.

How the neural network model used in the method of the present application is trained is briefly described below. From the foregoing, it can be seen that the ranging method mainly uses two neural network models, namely, a target detection model and a ranging model, which can be, but are not limited to, the following three-stage training method:

stage one: and training an initial target detection model by using the training image to obtain a preliminarily trained target detection model.

Stage two: and training an initial ranging model by using the training image to obtain a primarily trained ranging model. In the second stage, the distance measurement model is trained by using only the position of the region of interest and the region of interest image in the training image, but not using the full-image feature, and the required position of the region of interest and the region of interest image are obtained based on the target detection result of the initially trained target detection model in the first stage, and the specific method can refer to the related steps in fig. 1.

Stage three: the training image is used to continue training the initially trained target detection model in stage one and the initially trained ranging model in stage two (i.e., the two models are trained together) to obtain a trained target detection model and a trained ranging model (the steps of fig. 1 may then be performed using the trained models). In the third stage, three information including the position of the region of interest in the training image, the region of interest image and the full-image feature are used when the ranging model is trained, the required position of the region of interest and the region of interest image are obtained based on the target detection result of the current target detection model on the training image, and the required full-image feature is extracted from the training image by the current target detection model, and the specific method can refer to the related steps in fig. 1. The "current object detection model" is referred to herein because in stage three, the parameters of the object detection model are continually updated.

The three-stage training mode is beneficial to accelerating the convergence speed of the model and improving the ranging precision. The training process has several points to be noted: firstly, when calculating the loss of the target detection model (in the stage one and the stage three), calculation is required to be performed on all targets, but not only on the target to be detected; secondly, if the detection frame and/or classification of the target to be detected in the ranging model are re-predicted, the loss of the target detection is calculated in addition to the loss of the ranging model (in the second stage and the third stage). Thirdly, in the third stage of the training process, the ratio of the loss of the distance prediction to the total loss can be properly increased, because the inventor researches and discovers that the loss value of the distance prediction is generally smaller (especially for the implementation of the predicted distance coefficient), so that increasing the loss ratio is beneficial to improving the ranging accuracy of the model.

Fig. 4 shows a functional block diagram of a ranging apparatus 200 according to an embodiment of the present application. Referring to fig. 4, the ranging apparatus 200 includes:

the object detection module 210 is configured to detect an object in an original image by using an object detection model and extract full-image features of the original image; the target detection model is a convolutional neural network model;

the region of interest processing module 220 is configured to obtain a region of interest image according to a position of a target to be detected in a corresponding region of interest in the original image; the target to be detected is included in the detected target, the region of interest is determined according to a detection frame of the target to be detected, and the region of interest image is a part of the original image in the region of interest;

a ranging module 230, configured to obtain a ranging result between the vehicle and the target to be measured using a ranging model based on the position of the region of interest, the region of interest image and the full-image feature; wherein, the ranging model is a convolutional neural network model.

In one implementation of ranging device 200, the device further comprises:

the object screening module is configured to screen an object that has a collision risk with the vehicle from the detected objects as the object to be detected after the object detection module 210 detects the objects in the original image by using the object detection model and extracts the full-image feature of the original image, and before the region-of-interest processing module 220 obtains the region-of-interest image according to the position of the region of interest corresponding to the object to be detected.

In one implementation of ranging device 200, the device further comprises:

the region of interest determining module is configured to determine, as a region of interest corresponding to the original image, a region formed by expanding a detection frame of the target to be detected at a home position according to a preset ratio after the target detecting module 210 detects the target in the original image using the target detection model and extracts a full-image feature of the original image, and before the region of interest processing module 220 obtains the region of interest image according to a position of the region of interest corresponding to the target to be detected.

In one implementation of the ranging apparatus 200, the ranging module 230 obtains a ranging result between the vehicle and the target to be measured using a ranging model based on the location of the region of interest, the region of interest image, and the full map feature, including: normalizing the numerical value representing the position of the region of interest and expanding the numerical value into a feature map; each normalized numerical value is expanded to generate a corresponding characteristic diagram, and pixel values in the characteristic diagram are all normalized numerical values; scaling the region-of-interest image to a preset size and then normalizing to obtain a normalized region-of-interest image; and inputting the feature map generated by expansion, the normalized region-of-interest image and the full-map feature into the ranging model for forward propagation, and obtaining the ranging result between the vehicle and the target to be measured, which is output by the ranging model.

In one implementation of the ranging apparatus 200, the ranging module 230 inputs the feature map generated by the expansion, the normalized image of the region of interest, and the full-map feature to the ranging model for forward propagation, to obtain the ranging result between the vehicle and the target to be measured, which is output by the ranging model, and includes: extracting features of the normalized region-of-interest image by using the ranging model to obtain region-of-interest features; fusing the region of interest features, the feature map generated by expansion and the full map features by using the ranging model to obtain fused features; and predicting the distance between the vehicle and the target to be measured based on the fusion characteristic by using the ranging model to obtain the ranging result.

In one implementation of the ranging apparatus 200, the ranging module 230 performs feature extraction on the normalized region of interest image by using the ranging model to obtain a region of interest feature, including: performing feature extraction and downsampling on the normalized region-of-interest image by using at least one first convolution unit at the beginning of the ranging model to obtain a first intermediate feature; wherein the first convolution unit comprises a convolution layer and a pooling layer; sequentially extracting the first intermediate feature by using a plurality of residual blocks in the ranging model, and summing the extracted feature of the last residual block with the feature extracted by at least one residual block in the residual blocks by using a short-circuit structure of the last residual block to obtain a second intermediate feature; further fusing the second intermediate features by using a second convolution unit in the ranging model to obtain the region-of-interest features; wherein the second convolution unit comprises two convolution layers.

In one implementation of the ranging apparatus 200, the fusion feature includes a plurality of channels, each channel corresponds to a preset distance range, and the ranging module 230 predicts the distance between the vehicle and the target to be measured based on the fusion feature by using the ranging model, to obtain the ranging result, including: predicting the distance between the vehicle and the target to be detected and positioned in the distance range corresponding to the channel and the confidence thereof based on the fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model;

the ranging module 230 is also configured to: and after the ranging result is obtained, determining the distance with the highest confidence as the final distance between the vehicle and the target to be measured.

In one implementation of the ranging apparatus 200, the fusion feature includes a plurality of channels, each channel corresponds to a distance range, and the ranging module 230 predicts the distance between the vehicle and the target to be measured based on the fusion feature by using the ranging model, to obtain the ranging result, including: predicting a distance coefficient and a confidence coefficient between the vehicle and the target to be detected, which are effective in a distance range corresponding to the channel, based on fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model;

The ranging module 230 is also configured to: after the distance measurement result is obtained, determining a distance coefficient with highest confidence as a final distance coefficient; and calculating the product of the final distance coefficient and the reference distance in the effective distance range to obtain the distance between the vehicle and the target to be detected.

In one implementation of ranging device 200, the reference distance within any range of distances is the median of that range of distances.

In one implementation of ranging device 200, ranging module 230 is further to: and after obtaining the region of interest characteristics, re-predicting the category and/or detection frame of the target to be detected based on the region of interest characteristics by using the ranging model.

In one implementation of ranging device 200, the device further comprises:

a training module for performing the following steps to train the object detection model and the ranging model before the object detection module 210 detects the object in the original image using the object detection model and extracts the full-image features of the original image: training the initial target detection model to obtain a preliminarily trained target detection model; training the initial ranging model to obtain the primarily trained ranging model; the method comprises the steps of training, namely, only using the position of an interested region in a training image and an interested region image, and obtaining a target detection result of the training image based on a preliminarily trained target detection model without using full-image features and the required position of the interested region and the interested region image; continuously training the preliminarily trained target detection model and the preliminarily trained ranging model to obtain the trained target detection model and the trained ranging model; the distance measurement model is trained by using the position of the region of interest, the region of interest image and the full-image feature in the training image, the required position of the region of interest and the region of interest image are obtained based on the target detection result of the current target detection model on the training image, and the required full-image feature is extracted from the training image by the current target detection model.

The distance measuring device 200 according to the embodiment of the present application has been described in the foregoing method embodiment, and for brevity, reference may be made to the corresponding contents of the method embodiment where the device embodiment is not mentioned.

Fig. 5 shows a possible structure of an electronic device 300 according to an embodiment of the present application. Referring to fig. 5, the electronic device 300 includes: processor 310, memory 320, and communication interface 330, which are interconnected and communicate with each other by a communication bus 340 and/or other forms of connection mechanisms (not shown).

The Memory 320 includes one or more (Only one is shown in the figure), which may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable programmable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), and the like. The processor 310, as well as other possible components, may access, read, and/or write data from, the memory 320.

The processor 310 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The processor 310 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a micro control unit (Micro Controller Unit, MCU), a network processor (Network Processor, NP), or other conventional processor; but may also be a special purpose processor including a graphics processor (Graphics Processing Unit, GPU), a Neural network processor (Neural-network Processing Unit, NPU for short), a digital signal processor (Digital Signal Processor, DSP for short), an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short), a field programmable gate array (Field Programmable Gate Array, FPGA for short) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Also, when the processor 310 is plural, some of them may be general-purpose processors, and another may be special-purpose processors.

The communication interface 330 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for data interaction. Communication interface 330 may include an interface for wired and/or wireless communication.

One or more computer program instructions may be stored in memory 320 that may be read and executed by processor 310 to implement the ranging methods provided by embodiments of the present application.

It is to be understood that the configuration shown in fig. 5 is merely illustrative, and that electronic device 300 may also include more or fewer components than those shown in fig. 5, or have a different configuration than that shown in fig. 5. The components shown in fig. 5 may be implemented in hardware, software, or a combination thereof. The electronic device 300 may be a physical device such as a vehicle device, a PC, a notebook, a tablet, a cell phone, a server, etc., or may be a virtual device such as a virtual machine, a virtualized container, etc. The electronic device 300 is not limited to a single device, and may be a combination of a plurality of devices or a cluster of a large number of devices.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores computer program instructions which execute the ranging method provided by the embodiment of the application when being read and run by a processor of a computer. For example, the computer-readable storage medium may be implemented as memory 320 in electronic device 300 in FIG. 5.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A ranging method, comprising:

detecting a target in an original image by using a target detection model and extracting full-image features of the original image; the target detection model is a convolutional neural network model;

acquiring an interested region image according to the position of an interested region corresponding to the target to be detected in the original image; the target to be detected is included in the detected target, the region of interest is determined according to a detection frame of the target to be detected, and the region of interest image is a part of the original image in the region of interest;

based on the position of the region of interest, the region of interest image and the full-image feature, a ranging result between a vehicle and the target to be measured is obtained by using a ranging model; wherein the ranging model is a convolutional neural network model;

The obtaining, by using a ranging model, a ranging result between a vehicle and the target to be measured based on the position of the region of interest, the region of interest image and the full-view feature includes:

normalizing the numerical value representing the position of the region of interest and expanding the numerical value into a feature map; each normalized numerical value is expanded to generate a corresponding characteristic diagram, and pixel values in the characteristic diagram are all normalized numerical values;

scaling the region-of-interest image to a preset size and then normalizing to obtain a normalized region-of-interest image;

and inputting the feature map generated by expansion, the normalized region-of-interest image and the full-map feature into the ranging model for forward propagation, and obtaining the ranging result between the vehicle and the target to be measured, which is output by the ranging model.

2. The ranging method according to claim 1, wherein after the detecting the target in the original image using the target detection model and extracting the full-image feature of the original image, and before the acquiring the region-of-interest image according to the position of the region-of-interest corresponding to the target to be detected, the method further comprises:

And screening the targets with collision risk with the vehicle from the detected targets to serve as the targets to be detected.

3. The ranging method according to claim 1, wherein after the detecting the target in the original image using the target detection model and extracting the full-image feature of the original image, and before the acquiring the region-of-interest image according to the position of the region-of-interest corresponding to the target to be detected, the method further comprises:

and determining a region formed by expanding the detection frame of the target to be detected at the original position according to a preset proportion as a region of interest corresponding to the detection frame in the original image.

4. The ranging method according to claim 1, wherein the inputting the feature map generated by the expansion, the normalized region-of-interest image, and the full-map feature into the ranging model for forward propagation, to obtain the ranging result between the vehicle and the target to be measured output by the ranging model, includes:

extracting features of the normalized region-of-interest image by using the ranging model to obtain region-of-interest features;

fusing the region of interest features, the feature map generated by expansion and the full map features by using the ranging model to obtain fused features;

And predicting the distance between the vehicle and the target to be measured based on the fusion characteristic by using the ranging model to obtain the ranging result.

5. The ranging method according to claim 4, wherein the feature extraction of the normalized region of interest image by using the ranging model to obtain a region of interest feature comprises:

performing feature extraction and downsampling on the normalized region-of-interest image by using at least one first convolution unit at the beginning of the ranging model to obtain a first intermediate feature; wherein the first convolution unit comprises a convolution layer and a pooling layer;

sequentially extracting the first intermediate feature by using a plurality of residual blocks in the ranging model, and summing the extracted feature of the last residual block with the feature extracted by at least one residual block in the residual blocks by using a short-circuit structure of the last residual block to obtain a second intermediate feature;

further fusing the second intermediate features by using a second convolution unit in the ranging model to obtain the region-of-interest features; wherein the second convolution unit comprises two convolution layers.

6. The ranging method as defined in claim 4, wherein the fusion feature comprises a plurality of channels, each channel corresponding to a preset distance range, the predicting the distance between the vehicle and the target to be measured based on the fusion feature using the ranging model, and obtaining the ranging result comprises:

predicting the distance between the vehicle and the target to be detected and positioned in the distance range corresponding to the channel and the confidence thereof based on the fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model;

after the obtaining the ranging result, the method further comprises:

and determining the distance with the highest confidence as the final distance between the vehicle and the target to be detected.

7. The ranging method as defined in claim 4, wherein the fusion feature comprises a plurality of channels, each channel corresponding to a range of distances, wherein predicting the distance between the vehicle and the target under test based on the fusion feature using the ranging model to obtain the ranging result comprises:

predicting a distance coefficient and a confidence coefficient between the vehicle and the target to be detected, which are effective in a distance range corresponding to the channel, based on fusion characteristics of the channel by using at least one full connection layer arranged for each channel in the ranging model;

After the obtaining the ranging result, the method further comprises:

determining the distance coefficient with the highest confidence as a final distance coefficient;

and calculating the product of the final distance coefficient and the reference distance in the effective distance range to obtain the distance between the vehicle and the target to be detected.

8. The ranging method as recited in claim 7 wherein the reference distance within any one of the distance ranges is the median of that distance range.

9. Ranging method according to any of claims 4-8, characterized in that after said obtaining the region of interest feature, the method further comprises:

and re-predicting the category and/or detection frame of the target to be detected based on the region of interest characteristics by using the ranging model.

10. The ranging method as defined in claim 1 wherein prior to said detecting an object in an original image using an object detection model and extracting full-image features of said original image, said method further comprises:

training the initial target detection model to obtain a preliminarily trained target detection model;

training the initial ranging model to obtain the primarily trained ranging model; the method comprises the steps of training, namely, only using the position of an interested region in a training image and an interested region image, and obtaining a target detection result of the training image based on a preliminarily trained target detection model without using full-image features and the required position of the interested region and the interested region image;

Continuously training the preliminarily trained target detection model and the preliminarily trained ranging model to obtain the trained target detection model and the trained ranging model; the distance measurement model is trained by using the position of the region of interest, the region of interest image and the full-image feature in the training image, the required position of the region of interest and the region of interest image are obtained based on the target detection result of the current target detection model on the training image, and the required full-image feature is extracted from the training image by the current target detection model.

11. A ranging apparatus, comprising:

the target detection module is used for detecting a target in an original image by using a target detection model and extracting the full-image characteristics of the original image; the target detection model is a convolutional neural network model;

the interested region processing module is used for acquiring an interested region image according to the position of the target to be detected in the corresponding interested region in the original image; the target to be detected is included in the detected target, the region of interest is determined according to a detection frame of the target to be detected, and the region of interest image is a part of the original image in the region of interest;

The ranging module is used for obtaining a ranging result between a vehicle and the target to be measured by using a ranging model based on the position of the region of interest, the region of interest image and the full-image feature; wherein the ranging model is a convolutional neural network model;

the ranging module obtains a ranging result between a vehicle and the target to be measured by using a ranging model based on the position of the region of interest, the region of interest image and the full-image feature, and the ranging module comprises: normalizing the numerical value representing the position of the region of interest and expanding the numerical value into a feature map; each normalized numerical value is expanded to generate a corresponding characteristic diagram, and pixel values in the characteristic diagram are all normalized numerical values; scaling the region-of-interest image to a preset size and then normalizing to obtain a normalized region-of-interest image; and inputting the feature map generated by expansion, the normalized region-of-interest image and the full-map feature into the ranging model for forward propagation, and obtaining the ranging result between the vehicle and the target to be measured, which is output by the ranging model.

12. A computer readable storage medium, having stored thereon computer program instructions which, when read and executed by a processor, perform the method of any of claims 1-10.

13. An electronic device comprising a memory and a processor, the memory having stored therein computer program instructions that, when read and executed by the processor, perform the method of any of claims 1-10.