WO2021083126A1

WO2021083126A1 - Target detection and intelligent driving methods and apparatuses, device, and storage medium

Info

Publication number: WO2021083126A1
Application number: PCT/CN2020/123918
Authority: WO
Inventors: 吕书畅; 程光亮; 石建萍
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2019-10-31
Filing date: 2020-10-27
Publication date: 2021-05-06
Also published as: KR20210098515A; JP2022535473A

Abstract

Target detection and intelligent driving methods and apparatuses, a device, and a storage medium. The target detection method comprises: performing feature extraction in multiple different dimensions respectively on a first image and a second image to obtain a first feature map in multiple different dimensions and a second feature map in multiple different dimensions (S101); and determining, according to the first feature map in multiple different dimensions and a label of the first image and the second feature map in corresponding dimensions, a target to be queried in the second image, the label of the first image being the result of labeling the target to be queried comprised in the first image (S102). The first image and the second image are expressed as features in multiple different dimensions, feature expression capabilities of the first image and the second image can be improved, thereby improving the accuracy of target detection.

Description

Target detection, intelligent driving method, device, equipment and storage medium

This application is required to be submitted to the Chinese Patent Office on October 31, 2019. The application number is 201911054823.1, and the application name is a Chinese invention patent application named "target detection, intelligent driving method, device, equipment and storage medium", and in October 2019 Submitted to the Chinese Patent Office on the 31st, the application number is 201911063316.4, the priority of the Chinese invention patent application with the application name "target search method, device, equipment and storage medium", which is incorporated into this application by reference with the entire content of this application .

Technical field

This application relates to the field of image processing, in particular to a method, device, device, and storage medium for target detection and intelligent driving.

Background technique

Single-sample semantic segmentation is an emerging problem in the field of computer vision and intelligent image processing. Single-sample semantic segmentation aims to use a single training sample of a certain category to make the segmentation model have the ability to recognize the pixel in the category. The proposal of single-sample semantic segmentation can effectively reduce the cost of sample collection and annotation of traditional image semantic segmentation problems.

Single-sample image semantic segmentation aims to train only a single sample for a certain category of objects, so that the segmentation model has the ability to recognize all pixels of the object. Target query can query the target contained in the image by means of image semantic segmentation. Image semantic segmentation includes single-sample image semantic segmentation. Traditional image semantic segmentation requires a large number of training images for all categories of objects to ensure model performance, which brings extremely high labeling costs.

Summary of the invention

The purpose of this application is to provide a target detection and intelligent driving method, device, equipment, and storage medium to solve the existing technical problem of low target detection accuracy.

In order to solve the above technical problems, the technical solution of this application is implemented as follows:

In one embodiment, a target detection method is provided, which includes: extracting a plurality of features of different scales on a first image and a second image, respectively, to obtain a plurality of first feature maps of different scales and a plurality of features of different scales. The second feature map; according to multiple first feature maps of different scales and labels of the first image, and the second feature map of corresponding scales, determine the target to be queried in the second image; the first The label of an image is a result of labeling the target to be queried contained in the first image.

In another embodiment, an intelligent driving method is provided, which includes: collecting road images; using the target detection method as described above to perform a search on the collected road images of the target to be queried according to the supporting image and the label of the supporting image. Query; wherein the label of the support image is the result of marking the target contained in the support image in the same category as the target to be queried; according to the query result, the intelligent driving device that collects the road image is controlled.

In another embodiment, a target detection device is provided, which includes: a feature extraction module and a determination module; the feature extraction module is used to perform feature extraction of a plurality of different scales on a first image and a second image, respectively, Obtain a plurality of first feature maps of different scales and a plurality of second feature maps of different scales; the determining module is used to obtain a plurality of first feature maps of different scales, the labels of the first images, and the corresponding scales The second feature map in the second image determines the target to be queried in the second image; the label of the first image is the result of labeling the target to be queried contained in the first image.

In another embodiment, an intelligent driving device is provided, which includes: a collection module for collecting road images; a query module for adopting the target detection method described above according to the support image and the tag pair of the support image. The collected road images are used to query the target to be queried; wherein the label of the support image is the result of marking the target contained in the support image and the target of the same category as the target to be queried; the control module is used for The query result controls the intelligent driving equipment that collects road images.

In another embodiment, a target detection device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the above-mentioned program when the program is executed. Target detection method.

In another embodiment, a smart driving device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program when the program is executed. The smart driving method as described above.

In another embodiment, a computer-readable storage medium is provided, and a computer program is stored thereon, and when the program is executed by a processor, the steps of the target detection method are realized, or when the program is executed by the processor, all the steps are realized. Describe the steps of the smart driving method.

In yet another embodiment, a chip for running instructions is provided, the chip includes a memory and a processor, the memory stores code and data, the memory is coupled with the processor, and the processor runs all The code in the memory enables the chip to execute the steps of the above-mentioned target detection method, or the processor runs the code in the memory so that the chip is used to execute the steps of the above-mentioned smart driving method.

In yet another embodiment, a program product containing instructions is provided. When the program product runs on a computer, the computer executes the steps of the above-mentioned target detection method, or when the program product runs on a computer When running on, the computer is made to execute the steps of the smart driving method described above.

In yet another embodiment, a computer program is provided. When the computer program is executed by a processor, it is used to execute the steps of the above-mentioned target detection method, or when the computer program is executed by a processor, To perform the steps of the smart driving method described above.

It can be seen from the above technical solutions that in the above embodiment, since the first feature map and the second feature map of different scales are obtained, the feature expression ability of the first image and the second image is improved, so that more judgments can be obtained. The similarity information between the first image and the second image enables the subsequent target detection to have a richer feature input when facing a single sample, thereby improving the segmentation accuracy of the single-sample semantic segmentation, thereby improving the target detection accuracy.

Description of the drawings

The following drawings only schematically illustrate and explain the application, and do not limit the scope of the application:

FIG. 1 is a flowchart of a target detection method provided by an embodiment of the application;

2 is a schematic structural diagram of a target detection model provided by an embodiment of the application;

FIG. 3 is a flowchart of a target detection method provided by an embodiment of the application;

4 is a schematic structural diagram of a symmetric cascade structure provided by an embodiment of the application;

FIG. 5 is a flowchart of a target detection method provided by an embodiment of the application;

6 is a schematic structural diagram of a target detection model provided by another embodiment of this application;

FIG. 7 is a schematic flowchart of a target query method provided by another embodiment of this application;

FIG. 8 is a schematic flowchart of a target query method provided by another embodiment of this application;

FIG. 9 is a schematic flowchart of a target query method provided by still another embodiment of this application;

FIG. 10 is a schematic flowchart of a target query method provided by another embodiment of this application;

FIG. 11 is a schematic flowchart of a smart driving method provided by an embodiment of the application;

FIG. 12 is a schematic diagram of a target detection process provided by an embodiment of the application;

FIG. 13 is a schematic diagram of a generation module and an aggregation module provided by an embodiment of the application;

FIG. 14 is a schematic diagram of comparison between the similarity feature extraction method in the target query method provided by the embodiment of the application and the extraction method in the related technology;

FIG. 15 is a schematic structural diagram of a target detection device provided by an embodiment of the application;

FIG. 16 is a schematic structural diagram of a smart driving device provided by an embodiment of the application;

FIG. 17 is a schematic structural diagram of a target detection device provided by an embodiment of the application;

Fig. 18 is a schematic structural diagram of a smart driving device provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments.

In the prior art, the single-sample image semantic segmentation deep learning model is to perform feature extraction on the query set image and the support set image respectively, where the query set image is the image that needs to be queried, and the support set image contains the target to be queried. , The target to be queried in the support set image is labeled in advance to obtain the label information. Combining the label information, the target in the query set image is determined by the similarity between the features of the support set image and the feature of the query set image.

However, in the prior art, the deep learning model expresses the support set image as a single feature vector, and the feature expression ability of the support set image is limited, which leads to the insufficient ability of the model to describe the similarity between the support set image feature and the query image pixel feature. , Resulting in low accuracy of the target query.

In the embodiment of the present application, the first image may be the above-mentioned support set image, and the second image may be the above-mentioned query set image. The first image and the second image are extracted with multiple features of different scales. And the second image are expressed as multiple features of different scales, which improves the feature expression ability of the first image and the second image, so that more information for judging the similarity between the first image and the second image can be obtained, and then Improve the accuracy of target query.

The technical solutions of the present application and how the technical solutions of the present application solve the above technical problems will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. The embodiments of the present application will be described below in conjunction with the accompanying drawings.

FIG. 1 is a flowchart of a target detection method provided by an embodiment of the application. In view of the above technical problems in the prior art, the embodiments of the present application provide a target detection method. The specific steps of the method are as follows:

Step 101: Perform multiple feature extractions of different scales on the first image and the second image, respectively, to obtain multiple first feature maps of different scales and multiple second feature maps of different scales.

In this embodiment, the second image is an image for which a target query needs to be performed. Through the target query, the pixel area where the target to be queried contained in the second image is located can be detected. Among them, the target to be queried can be determined according to actual conditions, for example, it can be an animal, plant, person, vehicle, etc., which is not limited here. The label information may be contour information, pixel information, etc. of the target to be queried in the first image, which is not limited here. Optionally, the tag information may be a binarized tag, and the pixel value of the pixel area where the target is located in the binarized tag is different from the pixel values of other areas in the image.

The target detection method of this embodiment can be applied to the target detection process of a vehicle. The vehicle can be an autonomous vehicle or a vehicle equipped with an Advanced Driver Assistance Systems (ADAS) system. It is understandable that the target detection method can also be applied to robots. Taking a vehicle as an example, the first image and the second image may be acquired by an image acquisition device on the vehicle, and the image acquisition device may be a camera, such as a monocular camera, a binocular camera, and the like.

In this embodiment, the first image can be extracted with multiple features of different scales through the feature extraction algorithm to obtain multiple first feature maps of different scales; the second image can be extracted with multiple features of different scales to obtain multiple The second feature map of different scales. Among them, the feature extraction algorithm can be CNN (Convolutional Neural Networks, convolutional neural network) algorithm, LBP (Local Binary Pattern, local binary pattern) algorithm, SIFT (Scale-invariant feature transform, scale-invariant feature transform) algorithm, HOG (Histogram of Oriented Gradient, directional gradient histogram) algorithm, etc., are not limited here.

In this embodiment, when the feature extraction algorithm is a CNN (Convolutional Neural Networks, convolutional neural network) algorithm, the target detection method of this embodiment can be applied to the target detection model shown in FIG. 2. As shown in FIG. 2, the target detection model 20 includes: a feature extraction network 21, a scale transformation module 22 and a convolution network 23. Among them, the feature extraction network 21 is a neural network, and the feature extraction network 21 can adopt an existing network architecture, such as a VGG (Visual Geometry Group) network, a Resnet network, or other general image feature extraction networks. For example, the first image and the second image can be input into the feature extraction network 21 at the same time for feature extraction of multiple different scales; or two feature extraction networks 21 can be set up, and the two feature extraction networks 21 have the same network architecture and For network parameters, the first image and the second image are respectively input to the two feature extraction networks 21 to perform feature extraction of multiple different scales on the first image and the second image, respectively. For example, multiple different scales can be pre-designated, and for each scale, feature extraction of the scale is performed on the first image and the second image respectively to obtain the first feature map and the second feature map of the scale.

Step 102: Determine the target to be queried in the second image according to the labels of the first feature map and the first image of multiple different scales, and the second feature map of the corresponding scale; Contains the results of marking the target to be queried.

In this embodiment, for the first feature map and the second feature map of each scale, the label information of the first image can be combined to obtain a similarity map that characterizes the similarity between the first feature map and the second feature map of the scale. . Then, through similarity maps of different scales, the target to be queried in the second image can be determined.

In this embodiment, by separately extracting multiple features of different scales on the first image and the second image, multiple first feature maps of different scales and multiple second feature maps of different scales are obtained; according to multiple first feature maps of different scales A feature map and the label of the first image, as well as the second feature map of the corresponding scale, determine the target to be queried in the second image; the label of the first image is to label the target to be queried contained in the first image result. Since the first feature map and the second feature map of different scales are obtained, the feature expression ability of the first image and the second image is improved, so that more judgments about the similarity between the first image and the second image can be obtained The information, which enables subsequent target detection to have a richer feature input when facing a single sample, thereby improving the segmentation accuracy of single-sample semantic segmentation, thereby improving the accuracy of target detection.

In the embodiment of the present application, if the first image contains the target of the same type as the target to be queried, the first image contains the posture, texture, color and other information of the target of the same type as the target to be queried. It may be different from the posture, texture, color and other information of the target included in the first image and of the same type as the target to be queried. For example, the target to be queried is traffic lights, the traffic lights contained in the first image are arranged vertically, and if traffic lights are contained in the second image, the traffic lights in the second image can be arranged horizontally, and the traffic lights are arranged in the first image and the second image. The state in the image can be inconsistent.

As shown in Fig. 3, multiple feature extractions of different scales are performed on the first image and the second image respectively to obtain multiple first feature maps of different scales and multiple second feature maps of different scales, including:

Step 301: Perform feature extraction on the first image and the second image respectively to obtain a first feature map and a second feature map.

As shown in FIG. 2, the feature extraction network 21 includes a first convolution module 211, a second convolution module 212, and a third convolution module 213. The first convolution module 211 includes three convolution layers connected in sequence, The second convolution module 212 and the third convolution module 213 each include one convolution layer.

For example, the first image and the second image can be simultaneously input into the first convolution module 211 shown in FIG. 2, and the first convolution module 211 respectively outputs corresponding feature extraction results according to the first image and the second image. Then the first convolution module 211 outputs the feature extraction results respectively according to the first image and the second image and then inputs it into the second convolution module 212. The second convolution module 212 according to the first convolution module 211 is based on the first image and the first image. The feature extraction results of the two images respectively output the corresponding feature extraction results, and then the second convolution module 212 according to the first convolution module 211 based on the feature extraction results of the first image and the second image respectively output the feature extraction results and then input the second image In the three-convolution module 213, the third convolution module 213 continues to perform feature extraction according to the feature extraction result output by the second convolution module 212, so as to output the feature extraction result of the first image and the feature extraction result of the second image respectively , Are the first feature map and the second feature map, respectively.

Step 302: Perform multiple scale transformations on the first feature map and the second feature map to obtain multiple first feature maps of different scales and multiple second feature maps of different scales.

As shown in FIG. 2, the first feature map and the second feature map are respectively input to the scale transformation module 22 to perform multiple scale conversions on the first feature map and the second feature map respectively through the scale transformation module 22, thereby respectively The first image and the second image are expressed as multiple feature maps of different sizes.

Optionally, performing multiple scale conversions on the first feature map and the second feature map respectively includes: performing down-sampling on the first feature map and the second feature map at least twice, respectively.

Optionally, performing down-sampling on the first feature map and the second feature map at least twice, respectively, includes: down-sampling the first feature map and the second feature map at the first sampling rate, respectively, to obtain a lower sampling rate than the first image. Sampling the first feature map of the first multiple and the second feature map down-sampling the second multiple of the second image; compare the first feature map of the first image down-sampling the first feature map and the second image using the second sampling rate respectively Downsampling the second feature map by the second multiple of the down-sampling to obtain the first feature map that is down-sampled by the second multiple of the first image and the second feature map that is down-sampled by the second multiple of the second image, the second multiple is greater than the first A multiple.

For example, use the first sampling rate to downsample the first feature map to obtain the first feature map that is downsampled by the first multiple of the first image; then use the second sampling rate to continue to compare the first image downsampled by the first multiple The first feature map is down-sampled to obtain a first feature map that is down-sampled by a second multiple of the first image, where the second multiple is greater than the first multiple. Similarly, for the second feature map, the first sampling rate is also used to downsample the second feature map to obtain the second feature map that is downsampled by the first multiple of the second image; then the second sampling rate is used to continue to compare the first feature map. The second feature map that is down-sampled by the second multiple of the two images is down-sampled to obtain a second feature map that is down-sampled by the second multiple of the second image.

Optionally, the first feature map and the second feature map are down-sampled using the first sampling rate, respectively, to obtain the first feature map that is down-sampled by a first multiple of the first image and a second multiple that is down-sampled than the second image. After the second feature map of the present application, the method of the embodiment of the present application further includes: using a third sampling rate to compare the first feature map that is down-sampled by a second multiple of the first image and the second feature that is down-sampled by a second multiple of the second image. The image is down-sampled to obtain a first feature map that is down-sampled by a third multiple of the first image and a second feature map that is down-sampled by a third multiple of the second image, and the third multiple is greater than the second multiple. Optionally, the first multiple, the second multiple, and the third multiple are 8 times, 16 times, and 32 times, respectively.

In an optional example, the scale conversion module 22 may adopt a symmetrical cascade structure. As shown in FIG. 4, the symmetrical cascade structure includes two cascade structures arranged symmetrically with each other, wherein each cascade structure includes successively Three connected sampling units. To facilitate understanding, the two cascade structures are referred to as the first cascade structure 41 and the second cascade structure 42 respectively, and the three sampling units included in the first cascade structure are respectively referred to as the first sampling unit and the second cascade structure. The second sampling unit and the third sampling unit; the three sampling units included in the second cascade structure are called the fourth sampling unit, the fifth sampling unit, and the sixth sampling unit, respectively. The sampling rates of the first sampling unit and the fourth sampling unit are the same, the sampling rates of the second sampling unit and the fifth sampling unit are the same, and the sampling rates of the third sampling unit and the sixth sampling unit are the same. For example, the first sampling unit and the fourth sampling unit respectively use the first sampling rate to sample the first feature map and the second feature map, thereby outputting the first image and the second image that are down-sampled by 8 times compared to the first image and the second image. A feature map and a second feature map; the second sampling unit and the fifth sampling unit respectively use the second sampling rate to continue sampling the results output by the first sampling unit and the fourth sampling unit, so that the output is compared with the first image and the first For the second image, down-sample the first feature map and the second feature map by 16 times; the third sampling unit and the sixth sampling unit use the third sampling rate to continue with the output results of the second sampling unit and the fifth sampling unit. Sampling, thereby outputting the first feature map and the second feature map that are down-sampled 32 times compared to the first image and the second image.

In this embodiment, the symmetric cascade structure shown in FIG. 4 may be used to perform multiple scale conversions on the first feature map and the second feature map respectively. For example, when the first cascade structure 41 is used to convert the first feature map to different scales, the first feature map is input into the first sampling unit, the second sampling unit, and the third sampling unit separately and sequentially to pass the first sampling unit, respectively. A sampling unit, a second sampling unit, and a third sampling unit perform down-sampling at different sampling rates, thereby outputting a first feature map that is down-sampled 8 times, 16 times, and 32 times compared to the size of the first image. Similarly, when the second cascade structure 42 is used to convert the second feature map to different scales, the second feature map is input into the fourth sampling unit, the fifth sampling unit, and the sixth sampling unit respectively and sequentially to pass the The fourth sampling unit, the fifth sampling unit, and the sixth sampling unit perform down-sampling at different sampling rates, thereby outputting a second feature map that is down-sampled 8 times, 16 times, and 32 times compared to the size of the second image.

It should be understood that the above-mentioned first cascade structure 41 and second cascade structure 42 may also be a two-level cascade structure. For example, the first cascade structure 41 and the second cascade structure 42 each include two cascades connected in sequence. Sampling unit.

Optionally, determining the target to be queried in the second image according to a plurality of first feature maps of different scales and labels of the first image, and a second feature map of corresponding scales includes: according to a plurality of first feature maps of different scales The label of the feature map and the first image determines multiple first feature vectors of different scales; multiple first feature vectors of different scales and second feature maps of corresponding scales are calculated according to a preset calculation rule to obtain a calculation result; According to the calculation result, the mask image of the second image is determined; according to the mask image, the target to be queried in the second image is determined. Optionally, the preset calculation rules include: inner product calculation rules, or cosine distance calculation rules. Wherein, the label of the first image refers to information indicating the target or the category of the object in the image.

Taking the preset calculation rule as the inner product as an example, as shown in Figure 2, the first feature map of each scale and the label of the first image can form a feature vector, for example, the first image is downsampled by 8 times. , 16 times and 32 times the first feature map and the label of the first image are interpolated to form a feature vector, hereinafter referred to as the first feature vector, the second feature vector and the third feature vector, and then the first feature The vector sums the second feature map downsampled 8 times compared to the second image to perform inner product operation, the second feature vector and the second feature map downsampled 16 times compared to the first image to perform the inner product operation, and the third The feature vector and the second feature map downsampled 32 times compared to the first image are subjected to the inner product operation to obtain three probability maps of different scales. The sizes of the three probability maps of different scales are respectively the same as those of the first feature vector and the first feature vector. The size of the second feature vector and the third feature vector are the same. It can also be considered that the sizes of the three probability maps of different scales are compared to the first image or the second image by down-sampling 8 times, 16 times, and 32 times the first feature. The size of the figure or the second characteristic figure is the same. After that, these three probability maps are input to the convolutional network 23, and the convolutional network 23 connects the three probability maps and convolves the connected images, so as to output the mask image mask of the second image to achieve The target detection effect on the second image.

Optionally, according to multiple first feature maps of different scales and labels of the first image, and second feature maps of corresponding scales, determining the target to be queried in the second image includes: first feature maps of multiple different scales The feature map, the label of the first image, and the second feature map of the corresponding scale are used as the guidance information of the third feature map of the corresponding scale to determine the image to be queried in the second image; wherein the third feature map is determined according to the second image, And the second feature map and the third feature map of the same scale are different. Compared with the foregoing embodiment, this embodiment adds a third feature map to guide the inner product operation results of different scales obtained in the foregoing embodiment, thereby further improving the accuracy of subsequent target detection. The three feature maps can use other feature extraction networks other than the feature extraction network 21 shown in FIG. 2 for feature extraction. The network architecture and network parameters of the feature extraction network of the third feature map are the same as those of the first and second feature maps. The architecture and network parameters are different, for example, the convolution kernel is different.

FIG. 5 is a flowchart of a target detection method provided by another embodiment of this application. On the basis of the foregoing embodiment, the target detection method provided in this embodiment specifically includes the following steps:

Step 501: Determine multiple first feature vectors of different scales according to multiple first feature maps of different scales and labels of the first images.

Step 502: Calculate multiple first feature vectors of different scales and second feature maps of corresponding scales according to a preset calculation rule to obtain multiple mask images of different scales.

The mask image obtained in this step will be used as guidance information to guide the third feature map.

Step 503: Determine the target to be queried in the second image according to the multiplication result of the multiple mask images of different scales and the third feature map of corresponding scales.

In this embodiment, the multiplication of multiple mask images of different scales with the third feature map of the corresponding scale refers to the value (scalar) of the mask image at the same position in the mask image of the same scale and the third feature map. Multiply the value (vector) of the third feature map.

The method of this embodiment can be applied to the detection model shown in FIG. 6. The detection model shown in FIG. 6 is different from the detection model shown in FIG. 2 in that it is based on the feature extraction network 21 shown in FIG. Some convolutional layers are added, and a third cascade structure is added on the basis of the symmetric cascade structure shown in FIG. 2. Wherein, the structure of the third cascade structure is the same as the structure of the first cascade structure or the second cascade structure, and its implementation principle can be referred to the introduction of the foregoing embodiment.

As shown in FIG. 6, the detection model 60 includes a feature extraction network 61, a scale conversion module 62 and a convolutional network 63. Among them, the feature extraction network 61 includes a fourth convolution module 611, a fifth convolution module 612, a sixth convolution module 613, a seventh convolution module 614, an eighth convolution module 615, a ninth convolution module 616, and a Ten convolution module 617. The network architecture of the fourth convolution module 611, the fifth convolution module 612, the sixth convolution module 613 and the first convolution module 211, the second convolution module 212, and the third convolution module 213 shown in FIG. 2 It is the same as the network parameter, and its function and principle can be referred to the content introduction of the embodiment shown in FIG. 2. This embodiment mainly focuses on the difference between FIG. 6 and FIG. 2 for detailed description. It can be seen that on the basis of the feature extraction network 21 shown in FIG. 2, the sixth convolution module 613 (the third convolution module 213 in FIG. 2) is also connected to the seventh convolution module 614 and the fourth convolution module. After the product module 611 (the third convolution module 211 in FIG. 2 ), the eighth convolution module 615, the ninth convolution module 616, and the tenth convolution module 617 are sequentially connected. Among them, the outputs of the sixth convolution module 613 and the seventh convolution module 614 are also used as the input of the eighth convolution module 615 and the ninth convolution module 616, respectively. And the output of the tenth convolution module 617 is used as the input of the third cascade structure 33. The seventh convolution module 614 performs feature extraction according to the output results of the sixth convolution module 613 to obtain the first feature map and the second feature map, and then input the scale conversion module 62. The scale conversion module 62 is similar to the one shown in FIG. The scale conversion module 22 has the same structure and principle. The scale conversion module 62 performs different scale conversions on the first feature map and the second feature map. At the same time, the label information of the first image is also input into the scale conversion module 62. The scale conversion module 62 outputs a plurality of mask images mask32x, mask16x, and mask8x of different scales according to the first feature map, the second feature map of different scales, and the label information of the first image. Mask32x, mask16x, and mask8x respectively represent The mask image is down-sampled 32 times, 16 times, and 8 times than the first feature map or the second feature map. The mask images mask32x, mask16x, and mask8x output by the scale conversion module 62 are then down-sampled by the second image by 8 times, 16 times, and 32 times compared with the second image output by the third cascade structure to perform corresponding pixels. The multiplication operation at the position results in three probability maps. After that, the three probability maps are input into the convolutional network to perform operations such as convolution, so as to realize the target detection of the second image.

Optionally, in this embodiment, the feature map extracted by the sixth convolution module 613 can also be directly input into the third cascade structure.

Optionally, this embodiment may also directly input the feature map for the first image and the feature map for the second image output by the sixth convolution module 613 into the first cascade structure and the second cascade structure, respectively.

Optionally, the first convolution module, the second convolution module, and the third convolution module shown in FIG. 2 are a standard VGG network architecture. Those skilled in the art can use the VGG network architecture shown in FIG. 2 according to actual needs. On the basis of the network architecture and the fourth convolution module, fifth convolution module, sixth convolution module, and seventh convolution module in FIG. 6, the number of convolution modules is increased or decreased. In this embodiment of the application, a plurality of first feature vectors of different scales are determined according to a plurality of first feature maps of different scales and the labels of the first images, and then the plurality of first feature vectors of different scales are combined with the second feature vectors of corresponding scales. The feature map is calculated according to a preset calculation rule to obtain a calculation result, a mask image of the second image is determined according to the calculation result, and a target to be queried in the second image is determined according to the mask image. Multiple mask images at different scales can guide the similarity of the segmentation of the second feature map at the corresponding scale (the mask images mask32x, mask16x, mask8x output by the scale conversion module 62 and the third cascade structure are based on the second image The output second feature map, which is down-sampled by 8, 16, and 32 times compared to the second image, is multiplied at the corresponding pixel position). In addition, taking the sixth convolution module as an example, since the output result of the fifth convolution module 612 on the second image is input to the sixth convolution module, the sixth convolution module can be based on the output result of the fifth convolution module. After fusion with the output result of the second image, feature extraction is performed again. In this way, richer feature information can be extracted, and during backpropagation, the feedback loss function can also carry richer information, making it more Adjust the network parameters of each convolution module in the feature extraction network. Therefore, in the subsequent target detection process, the detection accuracy of the detection model can also be further improved.

FIG. 7 is a schematic flowchart of a target detection method provided by another embodiment of this application. This embodiment describes in detail the specific implementation process of determining the target to be queried in the second image based on multiple first feature maps of different scales and label information of the first image, and second feature maps of corresponding scales. As shown in Figure 7, the method includes:

S701: Perform feature extraction of multiple different scales on the first image and the second image respectively, and generate multiple first feature maps of different scales and multiple second feature maps of different scales.

In this embodiment, S701 is similar to S101 in the embodiment of FIG. 1, and will not be repeated here.

S702. Determine multiple similarity maps of different scales according to multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales; a similarity map of one scale represents the first feature of the scale The similarity between the graph and the second feature graph.

In this embodiment, the similarity map of each scale contains the similarity information of the features between the first feature map and the second feature map of the scale.

Optionally, S702 may include: determining a plurality of first feature vectors of different scales according to label information of a plurality of first feature maps of different scales and the first image; and comparing the plurality of first feature vectors of different scales with corresponding scales. The second feature map of is multiplied element by element to obtain multiple similarity maps of different scales.

In this embodiment, for the first feature map of each scale, the first feature map of the scale and the label information of the first image may be multiplied to obtain the first feature vector of the scale. Then the first feature vector of this scale and the second feature map of this scale are multiplied element by element to obtain the similarity map of this scale. In the similarity map of this scale, a vector is used at each pixel location to express the similarity of the first feature vector and the second feature map at that location.

Compared with using inner product or cosine distance to express the similarity between two feature maps as a single-channel similarity map, and then perform semantic segmentation through the single-channel similarity map to achieve the target query. Taking the inner product as an example, the inner product of the two feature vectors at the same position on the two feature maps is obtained to obtain the value corresponding to each pixel position. Since each pixel position on the final similarity map corresponds to only one value, It can only characterize the feature information of a single channel, and the feature information of a single channel cannot fully express the features of the support set image, resulting in insufficient ability to describe the similarity between the feature maps, and thus the accuracy of the target query is low. This embodiment generates similarity maps of different scales by multiplying multiple first feature vectors of different scales and second feature maps of corresponding scales element by element, and replaces the inner product or cosine distance method by multiplying element by element. , Can make the similarity map of each scale contain multi-channel similarity information, make the similarity feature expression more fully, and further improve the accuracy of the target query.

S703. Integrate multiple similarity maps of different scales to obtain an integrated similarity map.

In this embodiment, similarity maps of different scales can be converted into similarity maps of the same scale through upsampling, and then integrated to obtain an integrated similarity map. Specifically, it can be implemented by either of the following two implementation manners, which will be described separately below.

In the first implementation manner, S703 may include: up-sampling multiple similarity maps of different scales to obtain multiple similarity maps of the same scale; adding multiple similarity maps of the same scale to obtain the integrated Similarity graph.

In this embodiment, multiple similarity maps of different scales may be respectively up-sampled into the same scale, and then added, so as to obtain the integrated similarity. For example, suppose there are three similarity graphs: A, B, and C. The scales of the three are m1, m2, m3, where m1>m2>m3. Then you can up-sample B and C separately, increase the scales of B and C to m1, and then add A and the up-sampled B and C to obtain the integrated similarity map. At this time, integrate The scale of the subsequent similarity map is m1. Or, specify a scale m4, m4>m1, up-sample A, B, and C respectively, increase the scales of A, B, and C to m4, and then add the up-sampled A, B, and C together, The integrated similarity map is obtained, and the scale of the similarity map is m4.

In the second implementation manner, S703 may include:

Multiple similarity maps of different scales constitute a similarity map set;

Up-sampling the similarity map with the smallest scale in the similarity map set to obtain the similarity map with the same scale as the second-smallest similarity map;

Add the obtained similarity map to the second-smallest similarity map to obtain a new similarity map;

Combine the similarity maps that have not undergone upsampling or addition processing in the similarity map set to form a new similarity map set, and repeat the upsampling and addition steps until the last similarity is obtained Degree graph, the last similarity graph obtained is the integrated similarity graph.

Take three similarity graphs as an example to illustrate the implementation. Suppose there are three similarity graphs: A, B, and C. The scales of the three are m1, m2, m3, where m1>m2>m3. C can be up-sampled first, and the scale of C can be increased to m2, and then B and the up-sampled C can be added to obtain a new similarity map D. The scale of D is m2. Then D is up-sampled, the scale of D is increased to m1, and A and the up-sampled D are added to obtain the final integrated similarity map.

S704: Determine the target to be queried in the second image according to the integrated similarity map.

In this embodiment, S704 is similar to S102 in the embodiment of FIG. 1, and will not be repeated here.

In this embodiment, multiple similarity maps of different scales are determined based on multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales, and then the multiple similarity maps of different scales are integrated , Obtain the integrated similarity map, and then determine the target to be queried in the second image according to the integrated similarity map, which can integrate multiple similarities at different scales, so that the integrated similarity includes multiple scales To further improve the accuracy of the target query.

FIG. 8 is a schematic flowchart of a target detection method provided by another embodiment of this application. The difference between this embodiment and the embodiment in FIG. 7 is that after determining multiple similarity maps of different scales in S702, before integrating the multiple similarity maps of different scales in S703, the multiple similarity maps of different scales are combined with corresponding The third feature map of the scale is multiplied element by element to obtain multiple similarity maps of different scales after processing. As shown in Figure 8, the method includes:

S801: Perform feature extraction of multiple different scales on the second image and the first image respectively, and generate multiple first feature maps of different scales and multiple second feature maps of different scales.

In this embodiment, S801 is similar to S101 in the embodiment of FIG. 1, and will not be repeated here.

S802. Determine multiple similarity maps of different scales according to multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales; a similarity map of one scale represents the first feature of the scale The similarity between the graph and the second feature graph.

In this embodiment, S802 is similar to S702 in the embodiment of FIG. 7, and will not be repeated here.

S803. Multiply multiple similarity maps of different scales and third feature maps of corresponding scales element by element to obtain processed similarity maps of different scales; wherein the third feature map is determined according to the second image, and The second feature map and the third feature map of the same scale are different.

S804. Integrate multiple processed similarity maps of different scales to obtain an integrated similarity map.

In this embodiment, S804 is similar to S704 in the embodiment of FIG. 7, and will not be repeated here.

In this embodiment, when performing feature extraction on the second image, not only multiple second feature maps of different scales are extracted, but also multiple third feature maps of different scales are extracted. For each scale, different feature extraction methods can be used for the second image, such as using two neural networks with different network parameters, etc., to obtain the second feature map and the third feature map of the scale respectively.

After determining multiple similarity maps of different scales according to multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales, for each scale of similarity map, the scale The similarity map of and the third feature map of this scale is multiplied element by element to obtain the processed similarity map of this scale. Then, the processed similarity maps of different scales are integrated to obtain an integrated similarity map.

S805: Determine the target to be queried in the second image according to the integrated similarity map.

In this embodiment, a plurality of similarity maps of different scales determined according to multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales are used to compare the third feature maps of the second image. The image is multiplied element by element, and multiple similarity maps of different scales can be used to guide the segmentation of the second image, thereby further improving the accuracy of the target query.

Fig. 9 is a flowchart of a target detection method provided by an embodiment of the present application.

As shown in Fig. 9, the target detection method of the foregoing embodiment is executed by a neural network, which is trained by the following steps:

Step 901: Perform feature extraction of a plurality of different scales on the first sample image and the second sample image respectively to obtain a plurality of fourth feature maps of different scales and a plurality of fifth feature maps of different scales; among them, the first is the same Both the present image and the second sample image contain objects of the first category.

Step 902: Determine the object of the first category in the second sample image according to the labels of the fourth feature map and the first sample image of multiple different scales, and the fifth feature map of the corresponding scale; The label is the result of labeling the objects of the first category contained in the first sample image.

Step 903: Adjust the network parameters of the neural network according to the determined difference between the object of the first category in the second sample image and the label of the second sample image; The result of labeling objects of the first category.

In this embodiment, the above-mentioned target query method is realized by a neural network, and the neural network may be trained first before the target query is performed. Specifically, a first sample image and a second sample image containing objects of the same category can be obtained from a training set containing multiple sample images, and this object is the target to be queried in the training process. Among them, the training set may include multiple subsets, and the sample images in each subset contain objects of the same category. For example, the categories may include vehicles, pedestrians, traffic lights (ie, traffic lights), etc., and the acquired first sample image and second sample image may both include traffic lights. Use the traffic lights as the target to be queried during this training. Label the traffic lights in the first sample image to obtain the label of the first sample image. Label the traffic lights in the second sample image to obtain the label of the second sample image.

The training process of this embodiment is similar to the process of the target detection method of the foregoing embodiment, and the specific implementation process can refer to the introduction of the foregoing embodiment. It should be noted that in this embodiment, the first sample image and the second sample image need to contain objects of the same category to train the neural network so that the neural network can recognize the association between images of the same category. For example, in the training phase, traffic lights can be used to train the neural network, and in the testing or application phase, street lights can be used to test the neural network or to apply the neural network.

FIG. 10 is a schematic flowchart of a target detection method provided by still another embodiment of this application. In this embodiment, the test method of the trained neural network in the embodiment of FIG. 9 is described in detail. As shown in Figure 10, the method may further include:

S1001. Perform feature extraction of a plurality of different scales on the first test image and the second test image, respectively, to obtain a plurality of first test feature maps of different scales and a plurality of second test feature maps of different scales; wherein, the first test The image and the second test image are derived from a test image set, and each test image in the test image set includes objects of the same category.

S1002. Determine the target to be queried in the second test image according to multiple first test feature maps of different scales and labels of the first test image, and second test feature maps of corresponding scales; the label of the first test image is correct The result of marking the target to be queried contained in the first test image.

In this embodiment, test images including objects of the same category may be pre-formed into a test image set, and multiple test image sets may be formed into a total test set. When testing the neural network, the first test image and the second test image are selected from a set of test images, and the neural network is tested through the first test image and the second test image. For example, the neural network can be tested through the first test image and the second test image containing street lights.

In one example, one sample can be selected as the first test image for each test category in the test image set. For example, in the PASCAL VOC test image set, one image is selected as the first test image for each category (a total of 20 categories). One test image. During the testing process, each sample in the test image set and its corresponding first test image form a test data pair, which is then input into the model shown in Figure 2 or Figure 5 for evaluation, where the test image in the test data pair Contains the same type of target. In this way, the problem of uneven selection of categories caused by traditional randomly selected test data pairs can be avoided, and at the same time, the problem of floating evaluation indicators caused by different sample quality can be solved. Optionally, during the test, the test may be performed after 100 trainings, or the test may be performed after 120 trainings. Those skilled in the art can make corresponding adjustments according to actual needs, which is not specifically limited in this embodiment.

With the trained neural network in the embodiment of this application, even if the number of training images corresponding to the category of the image to be queried has a low proportion in the training image set or is a category that has never been learned, the target method of this embodiment It can also be accurately detected. In addition, the method of randomly selecting test data pairs in the embodiments of the present application can also reduce the task’s strong dependence on samples, and can also accurately detect types of samples that are difficult to collect in actual application scenarios, and can avoid traditional randomly selected test pairs. The problem of uneven selection of categories is caused, and it also solves the problem of floating evaluation indicators due to the different quality of support samples. For example: in the target detection task in automatic driving, a certain target category in the scene that does not provide a large number of training samples can also be accurately detected.

FIG. 11 is a schematic flowchart of a smart driving method provided by an embodiment of the application. As shown in Figure 11, the method may include:

S1101. Collect road images.

S1102, using the target detection method as described above, query the collected road images for the target to be queried according to the support image and the tags of the support image; wherein, the tag of the support image is the same as the target to be queried contained in the support image. The result of labeling the target of the category.

S1103: Control the smart driving device that collects road images according to the query result.

In this embodiment, the smart driving device may include an autonomous vehicle, a vehicle equipped with an Advanced Driving Assistant System (ADAS), a robot, and the like. For example, it is possible to obtain road images collected by the smart driving device when driving or when it is stopped, and then use the above-mentioned target detection method to perform target detection on the road image. When the above-mentioned target detection method is adopted, the road image is used as the above-mentioned second image, and the supporting image is used as the above-mentioned first image. Then the intelligent driving equipment is controlled according to the target detection result. For example, it is possible to directly control intelligent driving equipment such as autonomous vehicles or robots to perform operations such as deceleration, braking, and steering, or to send instructions such as deceleration, braking, and steering to the driver of an ADAS-equipped vehicle. For example, if the query result shows that the traffic indicator in front of the smart driving device is red, the smart driving device is controlled to slow down and stop. If the query result shows that there is a pedestrian in front of the smart driving device, the smart driving device is controlled to brake.

FIG. 12 is a schematic diagram of a target detection process provided by an embodiment of this application. The first image is input to the first convolutional neural network to obtain multiple first feature maps of different scales of the first image, and the second image is input to the second convolutional neural network to obtain multiple second features of different scales of the second image Figure. The second feature map of the second image, the first feature map of the first image, and the label information of the first image are input to the generating module to obtain similarity maps of multiple scales. The similarity maps of multiple scales are input to the aggregation module to obtain the integrated similarity map. Input the integrated similarity map to the third convolutional neural network to obtain the semantic segmentation map of the second image, so as to realize the target detection of the second image.

FIG. 13 is a schematic diagram of a generation module and an aggregation module provided by an embodiment of the application. In the figure, conv represents the convolutional layer, and pool represents the pooling process. The feature map of the first image is input to the first convolution channel of the generating module 131 to obtain multiple first feature maps of different scales. The feature map of the second image is input to the second convolution channel of the generating module 131 to obtain a plurality of second feature maps of different scales, which are then multiplied and pooled with the label information of the first image to obtain the image of the first image. Multiple feature vectors of different scales. Multiple feature maps of different scales of the second image are respectively multiplied element by element with feature vectors of corresponding scales to obtain multiple similarity maps of different scales. The generating module 131 outputs multiple similarity maps of different scales to the aggregation module 132, and the aggregation module 132 integrates the multiple similarity maps of different scales, and outputs the integrated similarity maps.

FIG. 14 is a schematic diagram of comparison between the similarity feature extraction method and the similarity feature extraction method through inner product or cosine distance in the target detection method provided by an embodiment of the application. The left part of the figure is a schematic diagram of similarity features extracted by inner product or cosine distance. The right part of the figure is a schematic diagram of extracting similarity features by multiplying the vectors of corresponding pixel positions. Compared with the inner product or cosine distance, the method proposed in the embodiment of the present application uses a method of element-wise multiplication to change the output similarity map from a single channel to a multi-channel, which can retain the channel information of the similarity information, and at the same time It can be combined with subsequent convolution and nonlinear operations to further rationally express similarity features, thereby further improving the accuracy of target detection.

FIG. 15 is a schematic structural diagram of a target detection device provided by an embodiment of the application. The target detection device provided by the embodiment of the present application can execute the processing flow provided in the embodiment of the target detection method. As shown in FIG. 15, the target detection device 150 provided in this embodiment includes: a feature extraction module 151 and a determination module 152; The extraction module 151 is used for extracting multiple features of different scales on the first image and the second image to obtain multiple first feature maps of different scales and multiple second feature maps of different scales; the determining module 152 uses To determine the target to be queried in the second image according to the labels of the first feature map and the first image of multiple different scales, and the second feature map of the corresponding scale; The result of marking the target to be queried.

Optionally, when the feature extraction module 151 performs feature extraction of multiple different scales on the first image and the second image respectively to obtain multiple first feature maps of different scales and multiple second feature maps of different scales, specifically It includes: extracting features of the first image and the second image respectively to obtain the first feature map and the second feature map; respectively performing multiple scale transformations on the first feature map and the second feature map to obtain multiple first feature maps of different scales. One feature map and multiple second feature maps of different scales.

Optionally, when the feature extraction module 151 performs multiple scale transformations on the first feature map and the second feature map respectively, it specifically includes: performing down-sampling on the first feature map and the second feature map at least twice, respectively.

Optionally, when the determining module 152 determines the target to be queried in the second image according to multiple first feature maps of different scales and labels of the first image, and second feature maps of corresponding scales, it specifically includes: The first feature maps of different scales and the labels of the first image determine multiple first feature vectors of different scales; the multiple first feature vectors of different scales and the second feature maps of corresponding scales are combined according to a preset calculation rule The calculation is performed to obtain the calculation result; the mask image of the second image is determined according to the calculation result; the target to be queried in the second image is determined according to the mask image.

Optionally, when the determining module 152 determines the target to be queried in the second image according to multiple first feature maps of different scales and labels of the first image, and second feature maps of corresponding scales, it specifically includes: The first feature map of different scales, the label of the first image, and the second feature map of the corresponding scale are used as the guidance information of the third feature map of the corresponding scale to determine the image to be queried in the second image; wherein the third feature map is based on The second image is determined, and the second feature map and the third feature map of the same scale are different.

Optionally, the determining module 152 uses multiple first feature maps of different scales, labels of the first image, and second feature maps of corresponding scales as the guidance information of the third feature maps of corresponding scales to determine the to-be-determined image in the second image. The query map specifically includes: determining a plurality of first feature vectors of different scales according to a plurality of first feature maps of different scales and the labels of the first images; and combining the plurality of first feature vectors of different scales with the second feature vectors of corresponding scales. The feature map is calculated according to preset calculation rules to obtain multiple mask images at different scales; according to the result of multiplying multiple mask images with different scales and the third feature map of the corresponding scale, the second image is determined Query target.

Optionally, the preset calculation rules include: inner product calculation rules, or cosine distance calculation rules.

Optionally, the determining module 152 determines the target to be queried in the second image according to multiple first feature maps of different scales and label information of the first image, and second feature maps of corresponding scales, which specifically includes: The first feature map of different scales, the label information of the first image, and the second feature map of the corresponding scale determine multiple similarity maps of different scales; a similarity map of one scale represents the first feature map and the second feature of the scale Similarity of the graphs; integrate multiple similarity graphs of different scales to obtain an integrated similarity graph; determine the target to be queried in the second image according to the integrated similarity graph.

Optionally, the determining module 152 determines multiple similarity maps of different scales according to multiple first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales, which specifically includes: The first feature map and the label information of the first image are determined to determine multiple first feature vectors of different scales; the multiple first feature vectors of different scales and the second feature maps of corresponding scales are multiplied element by element to obtain multiple Similarity graphs of different scales.

Optionally, the determining module 152 integrates multiple similarity maps of different scales to obtain an integrated similarity map, which specifically includes: up-sampling multiple similarity maps of different scales to obtain multiple similarity maps of the same scale. Figure: Add multiple similarity maps with the same scale to get an integrated similarity map.

Optionally, the determining module 152 integrates a plurality of similarity maps of different scales to obtain an integrated similarity map, which specifically includes: a plurality of similarity maps of different scales constitute a similarity map set; The smallest similarity map is up-sampled to obtain a similarity map of the same scale as the second-smallest similarity map; add the obtained similarity map to the second-smallest similarity map to obtain a new similarity map ; The similarity map that has not been up-sampled or added in the similarity map set is combined with the new similarity map to form a new similarity map set, and the up-sampling step and the adding step are repeated until the last one is obtained Similarity graph, the last obtained similarity graph is the integrated similarity graph.

Optionally, the determining module 152 is further configured to: multiply a plurality of similarity maps of different scales and a third feature map of corresponding scales element by element to obtain a plurality of processed similarity maps of different scales; wherein, the third The feature map is determined according to the second image, and the first feature map and the third feature map of the same scale are different; the processed similarity maps of different scales are integrated to obtain an integrated similarity map.

Optionally, the target detection device is implemented by a neural network, and the device further includes: a training module 153 for training to obtain a neural network by using the following steps. This step includes: performing multiple operations on the first sample image and the second sample image, respectively. The feature extraction of different scales obtains multiple fourth feature maps of different scales and multiple fifth feature maps of different scales; wherein, the first sample image and the second sample image both contain objects of the first category; The labels of the fourth feature map and the first sample image of different scales, and the fifth feature map of the corresponding scale, determine the first category of objects in the second sample image; the labels of the first sample image are the same as the first The result of labeling the objects of the first category contained in this image; adjust the network parameters of the neural network according to the determined difference between the objects of the first category in the second sample image and the labels of the second sample image; second The label of the sample image is the result of labeling the objects of the first category contained in the second sample image.

Optionally, the device further includes: a testing module 154 for testing the trained neural network; the testing module specifically uses the following steps to test the trained neural network: the first test image and the second test image are respectively tested Perform feature extraction of multiple different scales to obtain multiple first test feature maps of different scales and multiple second test feature maps of different scales; wherein the first test image and the second test image are derived from a test image set, Each test image in the test image set includes objects of the same category; according to multiple first test feature maps of different scales, labels of the first test images, and second test feature maps of corresponding scales, determine the second test image The target to be queried; the label of the first test image is the result of labeling the target to be queried contained in the first test image.

The target detection device provided in the embodiment of the present application can be used to implement the above-mentioned target detection method embodiment, and its implementation principles and technical effects are similar, and will not be repeated here in this embodiment.

FIG. 16 is a schematic structural diagram of a smart driving device provided by an embodiment of the application. As shown in FIG. 16, the intelligent driving device 160 provided in this embodiment includes: an acquisition module 161, a query module 162, and a control module 163; wherein, the acquisition module 161 is used to collect road images; the query module 162 is used to adopt the application In the target detection method provided by the embodiment, the collected road images are searched for the target to be queried according to the support image and the label of the support image; wherein the label of the support image is for the target contained in the support image and the target of the same category as the target to be queried. The result of labeling; the control module 163 is used to control the intelligent driving device that collects road images according to the query result.

The implementation of the smart driving device provided in the embodiment of the present application can refer to the foregoing smart driving method, and the implementation principle and technical effect are similar, and the details are not described herein again in this embodiment.

FIG. 17 is a schematic diagram of the hardware structure of a target detection device provided by an embodiment of the application. The target detection device provided in the embodiment of the present application can execute the processing flow provided in the embodiment of the target detection method. As shown in FIG. 17, the target detection device 170 provided in this embodiment includes: at least one processor 171 and a memory 172. The target detection device 170 also includes a communication component 173. Among them, the processor 171, the memory 172, and the communication component 173 are connected by a bus 174.

In a specific implementation process, at least one processor 171 executes the computer-executable instructions stored in the memory 172, so that the at least one processor 171 executes the above target detection method.

For the specific implementation process of the processor 171, refer to the foregoing embodiment of the target detection method. The implementation principles and technical effects are similar, and the details are not described here in this embodiment.

FIG. 18 is a schematic diagram of the hardware structure of a smart driving device provided by an embodiment of the application. The smart driving device provided in the embodiment of the present application can execute the processing flow provided in the smart driving method embodiment. As shown in FIG. 18, the smart driving device 180 provided in this embodiment includes: at least one processor 181 and a memory 182. The smart driving device 180 also includes a communication component 183. Among them, the processor 181, the memory 182, and the communication component 183 are connected by a bus 184.

In a specific implementation process, at least one processor 181 executes the computer-executable instructions stored in the memory 182, so that the at least one processor 181 executes the above intelligent driving method.

For the specific implementation process of the processor 181, refer to the foregoing embodiment of the smart driving method, and its implementation principles and technical effects are similar, and will not be repeated here in this embodiment.

In the above-mentioned embodiments shown in FIG. 17 and FIG. 18, it should be understood that the processor may be a central processing unit (English: Central Processing Unit, abbreviated as: CPU), or other general-purpose processors, digital signal processors ( English: Digital Signal Processor, abbreviation: DSP), Application Specific Integrated Circuit (English: Application Specific Integrated Circuit, abbreviation: ASIC), etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in combination with the application can be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.

The memory may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory.

The bus can be an Industry Standard Architecture (ISA) bus, Peripheral Component (PCI) bus, or Extended Industry Standard Architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, the buses in the drawings of this application are not limited to only one bus or one type of bus.

In another embodiment, the embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the target detection method or the intelligent driving method are realized.

In still another embodiment, an embodiment of the present application further provides a chip for executing instructions. The chip includes a memory and a processor. The memory stores code and data. The memory is coupled with the processor. The processor runs the code in the memory so that the chip is used to execute the steps of the above-mentioned target detection method or smart driving method.

In yet another embodiment, the embodiment of the present application further provides a program product containing instructions, which when the program product runs on a computer, causes the computer to execute the steps of the above-mentioned target detection method or smart driving method.

In yet another embodiment, the embodiment of the present application further provides a computer program, when the computer program is executed by a processor, it is used to execute the steps of the above-mentioned target detection method or smart driving method.

In the several embodiments provided in this application, it should be understood that the disclosed device and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium, and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute the method described in each embodiment of the present application. Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, that is, the device The internal structure is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. range.

Claims

A target detection method is characterized in that it comprises:

Performing multiple feature extractions of different scales on the first image and the second image, respectively, to obtain multiple first feature maps of different scales and multiple second feature maps of different scales;

According to a plurality of first feature maps of different scales and labels of the first image, and the second feature maps of corresponding scales, the target to be queried in the second image is determined; the label of the first image is The result of marking the target to be queried contained in the first image.
The method according to claim 1, wherein the first image and the second image are extracted with multiple features of different scales to obtain multiple first feature maps of different scales and multiple first feature maps of different scales. Two feature maps, including:

Performing feature extraction on the first image and the second image respectively to obtain a first feature map and a second feature map;

Perform multiple scale transformations on the first feature map and the second feature map, respectively, to obtain multiple first feature maps of different scales and multiple second feature maps of different scales.
The method according to claim 2, wherein said performing multiple scale transformations on said first feature map and said second feature map respectively comprises:

The first feature map and the second feature map are down-sampled at least twice, respectively.
The method according to any one of claims 1 to 3, wherein the first feature map and the label of the first image according to a plurality of different scales, and the second feature map of corresponding scales, Determining the target to be queried in the second image includes:

Determine a plurality of first feature vectors of different scales according to a plurality of first feature maps of different scales and labels of the first image;

Calculating the plurality of first feature vectors of different scales and the second feature map of corresponding scales according to a preset calculation rule to obtain a calculation result;

Determine the mask image of the second image according to the calculation result;

According to the mask image, the target to be queried in the second image is determined.
The method according to any one of claims 1 to 3, wherein the first feature map and the label of the first image according to a plurality of different scales, and the second feature map of corresponding scales, Determining the target to be queried in the second image includes:

Use multiple first feature maps of different scales, labels of the first image, and the second feature maps of the corresponding scales as the guidance information of the third feature maps of the corresponding scales to determine the to-be-queried in the second image image;

Wherein, the third feature map is determined according to the second image, and the second feature map and the third feature map of the same scale are different.
The method according to claim 5, wherein the first feature map with a plurality of different scales, the label of the first image, and the second feature map of the corresponding scale are used as the third feature of the corresponding scale The guide information of the image, which determines the image to be queried in the second image, includes:

Determine a plurality of first feature vectors of different scales according to a plurality of first feature maps of different scales and labels of the first image;

Calculating the plurality of first feature vectors of different scales and the second feature map of corresponding scales according to a preset calculation rule to obtain a plurality of mask images at different scales;

According to a multiplication result of a plurality of mask images of different scales and the third feature map of corresponding scales, the target to be queried in the second image is determined.
The method according to claim 4 or 6, wherein the preset calculation rules include: inner product calculation rules or cosine distance calculation rules.
The method according to claim 1, wherein said determining said first feature map according to said multiple first feature maps of different scales and label information of said first image, and a second feature map of corresponding scales 2. The target to be queried in the image includes:

According to the first feature maps of multiple different scales, the label information of the first image, and the second feature maps of the corresponding scales, multiple similarity maps of different scales are determined; a similarity map of one scale represents the first feature map of the scale. The similarity between one feature map and the second feature map;

Integrate multiple similarity maps of different scales to obtain an integrated similarity map;

According to the integrated similarity map, the target to be queried in the second image is determined.
8. The method according to claim 8, wherein said determining a plurality of different scales according to a plurality of first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales. Similarity graph, including:

Determine a plurality of first feature vectors of different scales according to the plurality of first feature maps of different scales and the label information of the first image;

The multiple first feature vectors of different scales and the second feature map of corresponding scales are multiplied element by element to obtain multiple similarity maps of different scales.
The method according to claim 8 or 9, wherein the integrating multiple similarity maps of different scales to obtain an integrated similarity map comprises:

Up-sampling multiple similarity maps of different scales to obtain multiple similarity maps of the same scale;

Add multiple similarity maps with the same scale to obtain an integrated similarity map.
The method according to claim 8 or 9, wherein the integrating a plurality of similarity maps of different scales to obtain an integrated similarity map comprises:

The multiple similarity graphs of different scales constitute a similarity graph set;

Up-sampling the similarity map with the smallest scale in the set of similarity maps to obtain a similarity map with the same scale as the second-smallest similarity map;

Add the obtained similarity map to the second-smallest similarity map to obtain a new similarity map;

The similarity graphs that have not undergone upsampling or addition processing in the similarity graph set are combined with the new similarity graphs to form a new similarity graph set, and the upsampling step and the adding step are repeated until the final result is obtained. A similarity graph, the last obtained similarity graph is the integrated similarity graph.
The method according to any one of claims 8-11, wherein the first feature map of a plurality of different scales, the label information of the first image, and the second feature map of the corresponding scale are determined After the multiple similarity maps of different scales are integrated, before the multiple similarity maps of different scales are integrated to obtain the integrated similarity map, the method further includes:

Multiple similarity maps of different scales and third feature maps of corresponding scales are multiplied element by element to obtain processed similarity maps of different scales; wherein, the third feature map is determined according to the second image , And the first feature map and the third feature map of the same scale are different;

Integrate multiple similarity maps of different scales to obtain an integrated similarity map, including:

The processed similarity maps of different scales are integrated to obtain an integrated similarity map.
The method according to any one of claims 1-12, wherein the target detection method is executed by a neural network, and the neural network is trained by the following steps:

Perform multiple feature extractions of different scales on the first sample image and the second sample image, respectively, to obtain multiple fourth feature maps of different scales and multiple fifth feature maps of different scales; wherein, the first sample Both the image and the second sample image contain objects of the first category;

Determine the object of the first category in the second sample image according to a plurality of fourth feature maps of different scales and labels of the first sample image, and the fifth feature map of corresponding scales; The label of the first sample image is a result of labeling the objects of the first category contained in the first sample image;

Adjust the network parameters of the neural network according to the determined difference between the object of the first category in the second sample image and the label of the second sample image; the label of the second sample image is The result of labeling the objects of the first category included in the second sample image.
The method according to claim 13, characterized in that, after the neural network training is completed, the method further comprises: testing the trained neural network;

Use the following steps to test the trained neural network:

Performing multiple feature extractions of different scales on the first test image and the second test image, respectively, to obtain multiple first test feature maps of different scales and multiple second test feature maps of different scales;

Wherein, the first test image and the second test image are derived from a test image set, and each test image in the test image set includes objects of the same category;

According to a plurality of first test feature maps of different scales and labels of the first test image, and the second test feature maps of corresponding scales, the target to be queried in the second test image is determined; the first The label of the test image is the result of labeling the target to be queried contained in the first test image.
An intelligent driving method, characterized in that it includes:

Collect road images;

The method according to any one of claims 1-14 is used to query the collected road images according to the supporting image and the label of the supporting image; wherein, the label of the supporting image is a reference to the supporting image. The result of marking the target in the same category as the target to be queried contained in the image;

According to the query results, the intelligent driving equipment that collects road images is controlled.
A target detection device is characterized by comprising: a feature extraction module and a determination module;

The feature extraction module is configured to perform feature extraction of a plurality of different scales on the first image and the second image respectively to obtain a plurality of first feature maps of different scales and a plurality of second feature maps of different scales;

The determining module is configured to determine the target to be queried in the second image according to a plurality of first feature maps of different scales and labels of the first image, and the second feature maps of corresponding scales; The label of the first image is a result of labeling the target to be queried contained in the first image.
The device according to claim 16, wherein the feature extraction module performs feature extraction of a plurality of different scales on the first image and the second image, respectively, to obtain a plurality of first feature maps of different scales and a plurality of When the second feature maps of different scales, specifically include:

Performing feature extraction on the first image and the second image respectively to obtain a first feature map and a second feature map;

Perform multiple scale transformations on the first feature map and the second feature map, respectively, to obtain multiple first feature maps of different scales and multiple second feature maps of different scales.
The device according to claim 17, wherein when the feature extraction module performs multiple scale transformations on the first feature map and the second feature map respectively, it specifically comprises:

The first feature map and the second feature map are down-sampled at least twice, respectively.
The device according to any one of claims 16-18, wherein the determining module is based on a plurality of first feature maps of different scales, labels of the first image, and the second features of corresponding scales. Figure, when determining the target to be queried in the second image, it specifically includes:

Determine a plurality of first feature vectors of different scales according to a plurality of first feature maps of different scales and labels of the first image;

Calculating the plurality of first feature vectors of different scales and the second feature map of corresponding scales according to a preset calculation rule to obtain a calculation result;

Determine the mask image of the second image according to the calculation result;

According to the mask image, the target to be queried in the second image is determined.
The device according to any one of claims 16-18, wherein the determining module is based on a plurality of first feature maps of different scales, labels of the first image, and the second features of corresponding scales. Figure, when determining the target to be queried in the second image, it specifically includes:

Use multiple first feature maps of different scales, labels of the first image, and the second feature maps of the corresponding scales as the guidance information of the third feature maps of the corresponding scales to determine the to-be-queried in the second image image;

Wherein, the third feature map is determined according to the second image, and the second feature map and the third feature map of the same scale are different.
The device according to claim 20, wherein the determining module uses a plurality of first feature maps of different scales, labels of the first images, and the second feature maps of corresponding scales as the first feature maps of corresponding scales. The guidance information of the three-characteristic image, which determines the image to be queried in the second image, specifically includes:

Determine a plurality of first feature vectors of different scales according to a plurality of first feature maps of different scales and labels of the first image;

Calculating the plurality of first feature vectors of different scales and the second feature map of corresponding scales according to a preset calculation rule to obtain a plurality of mask images of different scales;

According to a multiplication result of a plurality of mask images of different scales and the third feature map of corresponding scales, the target to be queried in the second image is determined.
The device according to claim 19, wherein the preset calculation rule comprises: a calculation rule of an inner product or a calculation rule of a cosine distance.
The device according to claim 16, wherein the determining module determines the first feature map of different scales, the label information of the first image, and the second feature map of the corresponding scale. The target to be queried in the second image specifically includes:

According to the first feature maps of multiple different scales, the label information of the first image, and the second feature maps of the corresponding scales, multiple similarity maps of different scales are determined; a similarity map of one scale represents the first feature map of the scale. The similarity between one feature map and the second feature map;

Integrate multiple similarity maps of different scales to obtain an integrated similarity map;

According to the integrated similarity map, the target to be queried in the second image is determined.
The device according to claim 23, wherein the determining module determines a plurality of different features according to a plurality of first feature maps of different scales, label information of the first image, and second feature maps of corresponding scales. The similarity map of the scale, including:

Determine a plurality of first feature vectors of different scales according to the plurality of first feature maps of different scales and the label information of the first image;

The multiple first feature vectors of different scales and the second feature map of corresponding scales are multiplied element by element to obtain multiple similarity maps of different scales.
The device according to claim 23 or 24, wherein the determining module integrates a plurality of similarity maps of different scales to obtain an integrated similarity map, which specifically includes:

Up-sampling multiple similarity maps of different scales to obtain multiple similarity maps of the same scale;

Add multiple similarity maps with the same scale to obtain an integrated similarity map.
The device according to claim 23 or 24, wherein the determining module integrates a plurality of similarity maps of different scales to obtain an integrated similarity map, which specifically includes:

The multiple similarity graphs of different scales constitute a similarity graph set;

Up-sampling the similarity map with the smallest scale in the set of similarity maps to obtain a similarity map with the same scale as the second-smallest similarity map;

Add the obtained similarity map to the second-smallest similarity map to obtain a new similarity map;

The similarity maps that have not undergone upsampling or addition processing in the similarity map set and the new similarity maps form a new similarity map set, and the upsampling step and the adding step are repeated until the final result is obtained. A similarity graph, the last obtained similarity graph is the integrated similarity graph.
The device according to any one of claims 23-26, wherein the determining module is further configured to:

Multiple similarity maps of different scales and third feature maps of corresponding scales are multiplied element by element to obtain processed similarity maps of different scales; wherein, the third feature map is determined according to the second image , And the first feature map and the third feature map of the same scale are different;

The processed similarity maps of different scales are integrated to obtain an integrated similarity map.
The device according to any one of claims 16-27, wherein the target detection device is implemented by a neural network, and the device further comprises: a training module for training to obtain the neural network by using the following steps: The steps include:

Perform multiple feature extractions of different scales on the first sample image and the second sample image, respectively, to obtain multiple fourth feature maps of different scales and multiple fifth feature maps of different scales; wherein, the first sample Both the image and the second sample image contain objects of the first category;

Determine the object of the first category in the second sample image according to a plurality of fourth feature maps of different scales and labels of the first sample image, and the fifth feature map of corresponding scales; The label of the first sample image is a result of labeling the objects of the first category contained in the first sample image;

Adjust the network parameters of the neural network according to the determined difference between the object of the first category in the second sample image and the label of the second sample image; the label of the second sample image is The result of labeling the objects of the first category included in the second sample image.
The device according to claim 28, wherein the device further comprises:

The test module is used to test the trained neural network;

The test module specifically uses the following steps to test the trained neural network:

Performing multiple feature extractions of different scales on the first test image and the second test image, respectively, to obtain multiple first test feature maps of different scales and multiple second test feature maps of different scales;

Wherein, the first test image and the second test image are derived from a test image set, and each test image in the test image set includes objects of the same category;

According to a plurality of first test feature maps of different scales and labels of the first test image, and the second test feature maps of corresponding scales, the target to be queried in the second test image is determined; the first The label of the test image is the result of labeling the target to be queried contained in the first test image.
An intelligent driving device, characterized in that it comprises:

Acquisition module, used to acquire road images;

The query module is configured to use the method according to any one of claims 1-14 to query the collected road images for the target to be queried according to the support image and the tag of the support image; wherein the tag of the support image Is the result of labeling the target contained in the supporting image and the target in the same category as the target to be queried;

The control module is used to control the intelligent driving equipment that collects road images according to the query result.
A target detection device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the program when the program is executed as in any one of claims 1-14 The method described.
An intelligent driving device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the method according to claim 15 when the program is executed.
A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the target detection method according to any one of claims 1-14, or the program is executed by the processor When realizing the intelligent driving method of claim 15.
A chip for running instructions, characterized in that the chip includes a memory and a processor, the memory stores code and data, the memory is coupled to the processor, and the processor runs the code in the memory The chip is used to execute the target detection method of any one of claims 1-14, or the processor runs the code in the memory so that the chip is used to execute the method of claim 15 Intelligent driving method.
A program product containing instructions, characterized in that, when the program product runs on a computer, the computer is caused to execute the target detection method according to any one of claims 1-14, or when the program When the product runs on a computer, the computer is made to execute the intelligent driving method described in claim 15.
A computer program, characterized in that, when the computer program is executed by a processor, it is used to execute the target detection method of any one of claims 1-14, or when the computer program is executed by a processor It is used to implement the intelligent driving method described in claim 15 above.