CN113514053A

CN113514053A - Method and device for generating sample image pair and method for updating high-precision map

Info

Publication number: CN113514053A
Application number: CN202110793044.4A
Authority: CN
Inventors: 何雷; 宋适宇
Original assignee: Apollo Intelligent Technology Beijing Co Ltd
Current assignee: Apollo Intelligent Technology Beijing Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-19
Anticipated expiration: 2041-07-13
Also published as: CN113514053B

Abstract

The present disclosure provides a method, apparatus, electronic device, and storage medium for generating a sample image pair. Relates to the field of artificial intelligence, in particular to the fields of computer vision, intelligent transportation, automatic driving and deep learning. The method of generating a sample image pair includes: determining position information of a target object included in the live-action image based on first map information corresponding to the live-action image; determining a random class combination of target objects included in the live-action image based on the predetermined class; determining position information of a deletion object for the live-action image based on position information of a target object included in the live-action image; and generating a first image to be updated for the live-action image based on the position information of the deletion object, the position information of the target object included in the live-action image, and the random class combination, and adding a label to an image pair composed of the first image to be updated and the live-action image. Wherein the predetermined categories include an add category and a no change category.

Description

Method and device for generating sample image pair and method for updating high-precision map

Technical Field

The present disclosure relates to the technical field of artificial intelligence, specifically to the technical field of computer vision, intelligent transportation, automatic driving, and deep learning, and more specifically to a method, an apparatus, an electronic device, and a storage medium for generating image sample pairs, and a method for updating a high-precision map.

Background

With the development of computer technology and network technology, the automatic driving technology and the intelligent navigation technology are gradually mature. Both techniques rely on High definition maps (High definition maps) that contain rich environmental information. The high-precision map can represent a traffic topology structure consisting of roads, traffic lights and the like.

With the development of cities and the change of traffic planning, the generated high-precision map needs to be continuously updated, so that the high-precision map can represent an actual traffic topological structure, support is provided for services provided by an automatic driving technology and an intelligent navigation technology, and the user experience of the provided services is improved.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, and storage medium for generating a sample image pair to train a target detection model based on the generated sample image pair so that the target detection model can detect changes in the image.

According to an aspect of the present disclosure, there is provided a method of generating a sample image pair, comprising: determining position information of a target object included in the target live-action image based on first map information corresponding to the target live-action image; determining a random category combination of the target object included in the target live-action image based on the predetermined category; determining position information of a deletion object for the target live-action image based on position information of the target object included in the target live-action image; and generating a first image to be updated for the target live-action image based on the position information of the deletion object, the position information of the target object included in the target live-action image, and the random category combination, and adding a label to an image pair composed of the first image to be updated and the target live-action image, wherein the label indicates actual update information of the target live-action image relative to the first image to be updated, and the predetermined category includes an addition category and a no-change category.

According to another aspect of the present disclosure, there is provided an apparatus for generating a sample image pair, comprising: the first position determining module is used for determining the position information of a target object included in the target live-action image based on first map information corresponding to the target live-action image; the category combination determining module is used for determining a random category combination of the target object included in the target live-action image based on a preset category; a second position determination module configured to determine position information of a deletion object for the target live-action image based on position information of the target object included in the target live-action image; the first image generation module is used for generating a first image to be updated aiming at the target live-action image based on the position information of the deleted object, the position information of the target object included in the target live-action image and the random category combination; and a first label adding module, configured to add a label to an image pair formed by the first image to be updated and the target live-action image, where the label indicates actual update information of the target live-action image with respect to the first image to be updated, and the predetermined category includes an addition category and a no-change category.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a method of updating a high-precision map, including: determining an image corresponding to the acquired live-action image in the high-precision map, and acquiring a third image to be updated; inputting the collected live-action image into a first feature extraction network of a target detection model to obtain first feature data; inputting a third image to be updated into a second feature extraction network of the target detection model to obtain second feature data; inputting the first characteristic data and the second characteristic data into a target detection network of a target detection model to obtain update information of the acquired live-action image relative to a third image to be updated; and updating the high-precision map based on the update information. The target detection model is trained based on the sample image pair generated by the method for generating the sample image pair provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flow diagram schematic of a method of generating a sample image pair according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of generating a first image to be updated for a live view image according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating the principle of determining location information of a deleted object according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a method for generating a sample image pair based on live-action video, according to an embodiment of the present disclosure;

FIG. 5 is a flow diagram of a method of training a target detection model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a structure of a target detection model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a structure of an object detection model according to another embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a parallel cross-point difference unit, according to an embodiment of the present disclosure;

FIG. 11 is a flow diagram of a method of determining updated information for an image using a target detection model according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an apparatus for generating a sample image pair according to an embodiment of the present disclosure;

FIG. 13 is a block diagram of an arrangement for training a target detection model according to an embodiment of the present disclosure;

FIG. 14 is a block diagram of an apparatus for determining updated information of an image using a target detection model according to an embodiment of the present disclosure; and

FIG. 15 is a block diagram of an electronic device for implementing the methods provided by embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Both the automatic driving technology and the intelligent navigation technology need to rely on high-precision maps containing rich environmental information. The environment information includes, for example, lane information, crosswalk information, position information of traffic lights, position information of intersections, and the like. High-precision maps are an important source of a priori knowledge and must be sufficiently preserved in their ability to reflect recent changes in the real world by constantly updating iterations. Real-world changes may include, for example, installation or removal of indicator lights, movement of the position of portable traffic lights, and the like.

In the related art, tasks related to map updating include a task of detecting scene changes, and technical schemes adopted by the tasks are mainly divided into three categories. The first category is the comparison between 3D models and 3D models using pre-constructed 3D CAD models and reconstructed models constructed by classical Multi-View Stereo (MVS) methods. The method is time-consuming and is only suitable for offline scenes. The second approach is to infer changes in the scene by comparing the newly acquired image to the original three-dimensional model. In particular, the rate of change is inferred by comparing the voxel color of the 3D voxel model with the pixel color of the corresponding image. A related alternative is to identify changes by re-projecting the new image onto the old image with the help of a given 3D model and to compare the information of the disparity between the new image and the old image. A third method is to use a two-dimensional comparison of an image representing the old state of the scene with an image representing the new state of the scene. The method includes the step of preparing a 2D image in advance.

In addition to detecting changes in a scene, the change detection task of a high-precision map, for example, should also identify the elements of the high-precision map that change and the type of change. A simple approach is to use standard object detectors to identify map elements in the image, project the map elements onto the image, correlate the projection with the detection, and finally obtain the corresponding changes by cross-comparison. Among them, object detection is a classic problem in computer vision. Solutions are mainly divided into two categories, namely two-stage methods and one-stage methods. As can be seen from this simple method, the entire process involves a number of steps. Each step has its optimization objective, and therefore, the overall method of detecting changes is difficult to achieve an overall optimal solution. For example, the target detector typically trades off accuracy against recall by setting a threshold detection confidence score and running a Non-Maximum Suppression (NMS). The method ignores important prior information of the high-precision map.

In order to realize the change detection task of a high-precision map, the invention provides an end-to-end learning method for directly detecting the image change. More specifically, an object detection model is used to detect missing or redundant elements in a high-precision map. In order to add prior information into the high-precision map, elements in the map can be projected onto the image to obtain an image to be updated. And taking the image to be updated and the live-action image as input of a target detection model, detecting the difference between the features extracted from the two images by the target detection model, and predicting to obtain the missing or redundant elements in the map based on the difference. The method disclosed by the invention can be popularized to the task of detecting the change of any object with a regular shape, and the method is not limited by the disclosure.

This is because The form of The HD Map Change Detection (HMCD) task is similar to The object Detection problem. The goal is to identify changes in predefined object classes (e.g., traffic lights, road guide signs, speed limit signs, etc.). The position of the detected object in the image can be described using a two-dimensional Bounding Box (Bounding Box) and assigned with the correct change category, which may include addition, deletion, no change, etc. For example, an object with a to _ add attribute is an object missing from the high-precision map (should be added), an object with a to del attribute is an object redundant in the high-precision map (should be deleted), and an object with a correct attribute is an object with no change in the high-precision map. Formally, for an online HMCD task using a single image as input, the problem to be solved can be represented by the following formula:

D_k＝f_θ(M，I_k，T_k，K)。

wherein, I_kIs the k-th image frame, T, in the video stream_kIs the global camera pose estimated by the positioning system in the autonomous vehicle, K is the internal reference matrix of the image acquisition device, and M is the high-precision map. Dk is a set of two-dimensional bounding boxes with corresponding classes of variation, as predicted by the HMCD predictor f_θBased on a set of learnable parameters theta. Among other things, the HMCD predictor may be the object detection model provided by the present disclosure.

The generation method of the sample image pair used to train the target detection model will be described below with reference to fig. 1 to 4.

Fig. 1 is a flow diagram of a method of generating a sample image pair according to an embodiment of the disclosure.

As shown in fig. 1, the method 100 of generating a sample image pair of this embodiment may include operations S110 to S150.

In operation S110, position information of a target object included in a target live-action image is determined based on first map information corresponding to the target live-action image.

According to an embodiment of the present disclosure, the target live-action image may be, for example, an image captured in real time while the vehicle is traveling. The vehicle may be, for example, an autonomous vehicle, which may be equipped with an image capture device (e.g., a camera), for example, via which real-time images are captured. The real-time image may be a separate image or a key frame in the captured video data.

According to an embodiment of the present disclosure, the vehicle may also be configured with a Global Navigation Satellite System (GNSS), for example. The high-precision map image positioned by the global navigation satellite system when the image acquisition device acquires the target live-action image can be used as the first map information. Wherein the high-precision map may be a newly updated map. The high-precision map has location information of target objects (e.g., the aforementioned predefined object classes) therein. The position information may be global positioning information, or may be two-dimensional plane coordinate information with respect to the high-precision map image converted from the global positioning information. The conversion rule is similar to the conversion rule adopted by the navigation system in the related art, and is not described in detail herein.

In an embodiment, taking the target object as a traffic light as an example, the position information of the traffic light in the collected target live-action image may be obtained through operation S110. In an embodiment, the position information may be represented by position information of a bounding box for the target object, for example. For example, the position information of the target object includes coordinates of a center point of the bounding box, and the width and height of the bounding box.

According to the embodiment of the disclosure, a live-action image meeting a predetermined position constraint condition can be selected from a plurality of collected live-action images, and a target live-action image of a generated sample is obtained. The sample may also be generated based on a pre-acquired target live-action image. For example, an image satisfying the predetermined position constraint condition may be acquired from the live-action image library as a target live-action image. The predetermined position constraint condition may include that a distance between the image acquisition device and a target object in the real scene is less than or equal to a predetermined distance. For example, the constraint condition may be that a distance between a target object in the real scene and the center of the image acquisition apparatus is 100m or less. Alternatively, the predetermined position constraint may comprise that an angle between the direction of the image acquisition arrangement and the reverse normal direction of the target object is not larger than a predetermined angle. The predetermined angle may be a small value such as 30 °. By the restriction of the predetermined position constraint condition, the definition of the target object in the target live-action image of the generated sample can be improved. Therefore, the accuracy of the target detection model obtained based on the generated sample training can be improved, and the updating accuracy of the high-accuracy map is improved.

According to an embodiment of the present disclosure, first map information corresponding to a target live-action image in an offline map (e.g., a high-precision map) may be determined based on a posture of an image capturing apparatus with respect to the target live-action image. For example, a Region of Interest (ROI) may be located from a high-precision map based on a global posture (including, for example, a position and an orientation, etc.) of the image capturing apparatus when the target live-action image is captured, and a two-dimensional image of the Region of Interest may be taken as the first map information. By the method, the obtained first map information can be matched with the target live-action image, and therefore the accuracy of the obtained target object position information is improved.

In operation S120, a random category combination of the target object included in the target live-view image is determined based on a predetermined category.

According to the embodiment of the present disclosure, based on the aforementioned determined first map information, the target object element may be queried. When the high-precision map is the latest updated map, the queried target object element may represent the target object included in the target live-action image, so that the position information and the number of the target object may be obtained. In order to generate the sample image pair, the category of any one target object in the target live-action image may be set, and the target object may be added to the non-updated high-precision map or may be a target object that is unchanged from the non-updated high-precision map.

In the HMCD task, the predetermined categories may include an add category and a no change category. If there is one target object in the target live-action image, the random category combination may be: the target object is an add category or the target object is a no change category. If the number of the target objects in the target live-action image is n, n is an integer greater than 1,there may be 2n cases for the class combination of the n target objects, and the embodiment may be from that 2ⁿIn one case, one is randomly selected as a random class combination.

In operation S130, position information of a deletion object for the target live-view image is determined based on position information of the target object included in the target live-view image.

According to the embodiment of the present disclosure, it is possible to set an arbitrary area in the target live view image, excluding the area indicated by the position information of the target object, as the position of the deletion object, and set the position information of the arbitrary area as the position information of the deletion object. Specifically, the position coordinates of any one of the areas capable of accommodating one target object, except the area indicated by the position information of the target object, may be used as the position information of the deletion target for the target live view image. The position information of the deletion object may include the center position coordinates of the arbitrary one area and the height and width of the arbitrary one area, similarly to the position information of the aforementioned target object. The arbitrary one area may be an area occupied by a bounding box of the deletion object.

In operation S140, a first image to be updated for the target live-action image is generated based on the position information of the deletion object, the position information of the target object included in the target live-action image, and the random category combination.

In operation S150, a label is added to an image pair composed of the first image to be updated and the target live-action image.

According to the embodiment of the present disclosure, a root image in which the pixel values of the same size as that of the target live-action image are all 0 may be generated. And determining the target object without the change category based on the random category combination. And then adding an interested region capable of indicating the position of the object in the basic image based on the position information of the deleted object and the position information of the target object without change category to obtain a Mask (Mask) image, and taking the Mask image as a first image to be updated. Thus, the target object of the added category in the target live-action image is the object added relative to the first image to be updated. The target object with no change category in the target live-action image is the object which is unchanged relative to the first image to be updated. The region of interest in the first image to be updated indicating the position of the object to be deleted is the region where the target live-action image is located relative to the object deleted from the first image to be updated.

Based on this, the embodiment may form an image pair of the target live-action image and the first image to be updated for the target live-action image, and obtain the live-action image and the image to be updated of the input target detection model. Subsequently, labeling the image pair based on the position information of the target object of each category to obtain the image pair with the bounding box indicating the target object, labeling the category of the target object as the category of the bounding box correspondingly, completing the operation of labeling the image pair to obtain the image pair with the label. The labeled image pair can be used as a sample image pair. For example, the added label may be embodied by a bounding box added for the live-action image in the image pair and a category added for the bounding box, such that the added label may indicate actual update information of the live-action image relative to the image to be updated. By adding the label, supervised training of the target detection model can be achieved. Specifically, the target detection model may be trained according to a difference between the predicted update information obtained by the target detection model and the actual update information.

In summary, when a sample image pair is constructed, the position information of the target object is located and obtained based on the high-precision map, the position information of the deleted object is calculated based on the positions of the target objects of the added type and the unchanged type, and the type combination of the target objects is randomly determined, so that the automatic generation of the image to be updated and the label can be realized, manual pre-labeling is not needed, and the labor cost for generating the sample image pair can be reduced. Moreover, the image to be updated in the sample image pair can be automatically generated without being recalled in advance, so that the situation that the sample image pair is difficult to collect due to sample sparsity can be avoided, the difficulty of generating the sample image pair is reduced, and the training precision of the target detection model is improved conveniently.

Fig. 2 is a schematic diagram of generating a first image to be updated for a live view image according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, the first image to be updated can be obtained by converting the high-precision map image into the raster image. Since each pixel position of the high-precision map image is the prior information, compared with the technical scheme of obtaining the first image to be updated based on the live-action image, the embodiment can improve the efficiency and the precision of generating the first image to be updated.

For example, the embodiment may first generate a first raster image for the target live-action image based on the first map information and the position information of the target object included in the target live-action image. The first raster image may indicate a location of a target object included in the target live-action image. Specifically, as shown in fig. 2, in this embodiment 200, a high-precision map image 210 as first map information may be first converted into a first raster image 220 based on the position information of the target object. In the first raster image 220, the other regions except the region corresponding to the target object are filled with black having pixel values of 0. The region corresponding to the target object is filled with white having pixel values of 255.

After obtaining the first raster image 220, the operation of generating the first image to be updated may include: and adjusting the first raster image based on the position information and the random category combination of the deleted object to obtain a first image to be updated. For example, the pixel color of the area corresponding to the deletion object in the first raster image may be changed to white to indicate the position of the deletion object based on the position information 230 of the deletion object. The target object of the addition category may be determined according to a random category combination, and based on the position information of the target object of the addition category (i.e., the position information 240 of the addition object), the pixel color of the region corresponding to the addition object in the first raster image is changed from white to black to remove the indication information of the position of the addition object, resulting in the first image to be updated 250. For example, if an area of the first raster image 220, which is white in color from the third pixel from left to right, indicates the position of the addition object, the resulting pixel color of the area in the first image to be updated 250 is black. If the area corresponding to the deletion target is located on the right side of the area with the white color of the fifth pixel from left to right in the first raster image 220, a white area is newly added on the right side of the area with the white color of the fifth pixel in the obtained first image to be updated 250. The resulting first image to be updated 250 may indicate the position of the deletion object in the target live-action image and the position of the target object of the unchanged object class in the target live-action image.

Fig. 3 is a schematic diagram illustrating the principle of a method for generating a sample image pair based on live-action video according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, before generating the image sample pairs, position information of target objects included in a plurality of target live-action images having mutually equal sizes may be counted to obtain position distribution information of the target objects as predetermined position distribution information. When the sample pair is generated, the position of the deletion object is located based on the predetermined position distribution information. Therefore, the obtained position information of the deleted object can better accord with the actual scene, and the learning capability and the model precision of the target detection model are improved conveniently.

As shown in fig. 3, the embodiment 300 may recall a plurality of target live-action images from the live-action image library based on the predetermined position constraint condition. In order to facilitate feature extraction of the target detection model and comparison of the live-action images with the images to be updated, the sizes of the target live-action images are equal to each other, and the sizes of the target live-action images are equal to the sizes of the corresponding images to be updated. Based on the foregoing method for determining the position information of the target object, the map information corresponding to each of the plurality of target live-action images may be obtained, and the position information of the target object included in each of the plurality of target live-action images may be obtained. Based on this, a raster image for each live view image may be generated, resulting in a plurality of raster images 310. By counting the position information of the target object included in the plurality of target live-action images, for example, the position distribution density map 320 of the target object may be generated. The position distribution density map 320 is used as predetermined position distribution information.

The embodiment may determine the position information of the deletion object by determining an area having a distribution density greater than a predetermined density based on the predetermined position distribution information. Specifically, a region having a distribution density greater than a predetermined density may be determined based on the predetermined position distribution information, and mapped to a first region in the first raster image for the target live-view image 340, the first region being a candidate region.

Illustratively, the candidate region may be represented based on the background image 330 as shown in fig. 3, considering that the sizes of the target live-view images are equal. The size of the background image 330 is equal to the size of the first raster image. The candidate region 331 is obtained by mapping the region having the distribution density greater than the predetermined density into the background image 330.

When generating a sample image pair based on a target live-action image 340 of a plurality of target live-action images, the position information 350 of the target object may be obtained according to the first map information corresponding to the target live-action image 340. Based on the position information, a position distribution map 360 of the target object in the target live-action image may be obtained. The histogram 360 is equal in size to both the background image 330 and the target live-action image 340.

After the candidate region is obtained, the second region may be removed from the candidate region, and the position information of the deletion object may be determined based on other regions outside the second region. Wherein the second region is a region indicating a position of the target object included in the target live-view image 340. For example, as shown in fig. 3, as for the candidate area 331, an area overlapping with an area representing the target object A, B, C in the histogram 360 may be set as the second area. After the second region is removed from the candidate region, a background image 370 indicating the other region may be obtained. This embodiment may select an arbitrary area capable of accommodating one target object from the other areas and use the position information of the arbitrary area as the position information of the deletion object for the target live-view image 340. For example, the position of the dot 371 in the background image 370 may be the center position of the arbitrary region.

According to the embodiment of the present disclosure, the size of the deletion object may be determined based on the average of the sizes of the target objects included in the target live view image. This is because the size of the deletion object is not specified directly based on the first map information corresponding to the target live-action image in the newly updated high-precision map. In this manner, an arbitrary area capable of accommodating the deletion object can be determined based on the size of the deletion object. By this means, the accuracy of the determined position information of the deletion object can be improved.

For example, the target size may be determined based on the size of the second region and the number of target objects included in the target live-action image. And determining any area with the size equal to the target size in the other areas to obtain the position information of the deleted object. The target size is the size of the deleted object, and is the average size of the target objects included in the target live-action image. For example, in embodiment 300, the target live-action image includes target objects comprising A, B, C. The size of the second area is A, B, C, which is the sum of the width and the height of the candidate area 331. The width sum and the height sum are divided by the number 3 of the target objects respectively to obtain the width and the height in the target size.

Based on the above-described flow, the position information of the target object and the position information of the deletion object included in the target live-action image 340 can be obtained. To facilitate understanding, this embodiment provides a location image 380, which location image 380 may indicate the location of the target object A, B, C and the location of the deleted object D. Subsequently, the first raster image for the target live-action image 340 may be adjusted according to the category of the target object in the target live-action image and the deletion category of the deletion object. For example, A, B, C, D, B is the add category, A, C is the no change category, and D is the delete category. This embodiment should delete the information indicating the position of B in the first raster image and add the information indicating the position of D, resulting in the first image to be updated 390 for the target live-action image 340. The first image to be updated 390 and the target live-action image 340 may constitute an image pair. Subsequently, the A, B, C, D position may be marked in the first image to be updated 390 and the target live-action image 340 based on the A, B, C, D position, and a label indicating the object class is marked for the position, so that the label addition is completed, and a sample image pair is obtained.

Fig. 4 is a schematic diagram illustrating the principle of a method for generating a sample image pair based on live-action video according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, video frames in the captured video data may also be employed to generate sample image pairs. Each video frame in the video data may be used as a target live-action image to generate a sample image pair including the each video frame by the method for generating the sample image pair described above.

According to the embodiment of the present disclosure, when generating a pair of sample images by using video data, for example, a target frame may be extracted from the video data first, and the target frame may be used as a target live-action image to generate a pair of sample images including the target frame. Then, based on the position information and the category of the target object included in the target frame, the position information and the category of the target object included in the video frames other than the target frame in the video data are presumed. This is because the video data is composed of a plurality of consecutive video frames, which are not independent from each other. Accordingly, the target live-action image described above may include a target frame in the live-action video. The target Frame may be, for example, a Key Frame (Key Frame) in the live-action video. For convenience of understanding, the video data may be segmented in advance, so that each video frame in each segmented live-action video includes the same target object.

For example, as shown in fig. 4, the process of generating a sample image pair based on a live-action video in this embodiment 400 may include operations S410 to S440. In operation S410, a target frame is first selected, and a tag 401 for the target frame is generated. The tag 401 may include a tag for a target object that may include at least one of three categories. The three categories are the aforementioned delete category, add category, and no change category. In one embodiment, the live-action video may include M frames of images, M being a natural number greater than 1, and the kth video frame Ok is selected from the live-action video as the target frame. Setting the target frame to include the target object a of the added category₁And a target object of the same category₂And there is a target object a with deleted deletion category relative to the first image to be updated₃. The resulting label 401 includes the target object a indicating the addition category₁Indicating a target object of a non-changing class₂And a target object a indicating a deletion category₃The bounding box of (1).

Target object a in the addition category₁And a target object of the same category₂For reference, operation S420 may be performed to add a tag for a non-change object and a tag for an added object to other frames except for the target frame in the live-action video. For example, the foregoing method of determining the position information of the target object included in the target live-view image may be adopted, and the position information of the target object included in the other frame may be determined based on the second map information corresponding to the other frame. The second map information is similar to the first map information described above, and is not described herein again. Subsequently, the category of the target object included in the other frames is determined based on the category of the target object included in the target frame. Based on the category and location information of the target object included in the other frame, a tag for a non-change object and a tag for an added object may be added to the other frame.

Illustratively, the target object A of the three-dimensional space in the high-precision map can be converted based on the conversion relationship between the two-dimensional space and the three-dimensional space adopted when the high-precision map is generated₁And A₂The position information of the video frame is projected to obtain A in the video frame₁And A₂The location information of (1).

Since there are no objects deleted in the newly updated high-precision map. The embodiment may perform three-dimensional reconstruction on the deleted object through operation S430 to obtain three-dimensional spatial information of the deleted object. For example, three-dimensional spatial information of the deletion object may be determined based on the depth of the target object included in the target frame and the position information of the deletion object to implement three-dimensional reconstruction of the deletion object based on the three-dimensional spatial information. The depth average value of the target object included in the target frame may be used as the depth of the deletion object. Three-dimensional spatial information of the deleted object is then obtained based on the depth of the deleted object, the positional information of the bounding box of the deleted object, and the positional information of the image capturing device at the time of capturing the target frame. When obtaining three-dimensional spatial information, for example, an internal reference matrix of the image capturing device needs to be considered.

In one embodiment, three-dimensional spatial information P_{to_del}The following formula can be used:

P_{to_del}＝R^-1[K-¹(d_{to_del}×pto_{_del})-t]。

wherein R is a rotation matrix of the image acquisition device, t is a translation matrix of the image acquisition device, d_{to_del}To delete the depth of an object, p_{to_del}Position information of a bounding box of the deletion object.

After obtaining the three-dimensional spatial information, the embodiment may generate a second image to be updated for the other frame based on the three-dimensional spatial information, the position information and the category of the target object of the other frame, and add a label to an image pair composed of the second image to be updated and the other frame.

According to an embodiment of the present disclosure, the generation method of the second image to be updated is similar to the generation method of the first image to be updated described earlier. For example, before generating the second image to be updated, the second raster image for the other frame may be generated based on the second map information and the position information of the target object included in the other frame. The second raster image indicates the location of the target object included in the other frame. The second raster image is generated in a method similar to the method for generating the first raster image described above, and the second raster image may be performed before operation S430 or after operation S430. After the second raster image is obtained, the second raster image may be adjusted based on the three-dimensional spatial information and the type of the target object included in the other frame, so as to obtain a second image to be updated. This method is similar to the method of adjusting the first raster image described above. The difference is that the embodiment requires two-dimensional projection of the reconstructed three-dimensional object to be deleted (operation S440) before adjusting the second raster image, and obtains the position information of the object to be deleted in other frames. The obtained second image to be updated may indicate the position of the deletion object in the other frame and the position of the target object of the unchanged category in the other frame.

Similarly, a method similar to the aforementioned method of adding a label to an image pair composed of the first image to be updated and the target live-action image may be adopted to add a label to an image pair composed of the second image to be updated and the other frame based on the obtained position information of the deletion object, the position information of the target object included in the other frame, and the category.

When the video data includes a plurality of video frames in addition to the target frame, for each video frame, a sample image pair may be obtained by the foregoing method. In this way, based on one video data, a sample image pair set composed of a plurality of sample image pairs can be obtained.

According to the embodiment of the disclosure, the deleted object is three-dimensionally reconstructed based on the deleted object of the target frame, and the sample image pair is generated based on the obtained three-dimensional spatial information, so that the consistency of the position of the deleted object in each video frame in the live-action video can be ensured, the efficiency of generating the label of the deleted object is improved to a certain extent, and the cost of generating the sample image pair based on the live-action video is reduced.

According to an embodiment of the present disclosure, based on the foregoing method, two sample sets may be generated, one including a plurality of Sample Image Sets (SICDs) constructed based on a single-frame live-action image, and one including a plurality of sample image sets (VSCDs) constructed based on video data. The live-action image and video data can be collected under the condition of less environmental change so as to comprehensively verify the difference of detection results obtained by adopting different target detection models. In order to realize the test of the target detection model, a test set can be formed based on the image to be updated and the live-action image adopted in the existing high-precision map updating process. The labels for each image pair in the test set may be manually labeled. To further increase the variety of samples, the embodiment may collect live-action images and video data in multiple cities.

According to the embodiment of the disclosure, when labeling the image pair based on the position information and the category of the target object and the deletion object, for example, bounding boxes with different colors may be added to the image pair for different types of objects, so as to generalize the method of the embodiment of the disclosure to change detection of multiple types of objects. Experiments show that when the image is rasterized in other detection tasks, the selection of the bounding boxes with different colors does not have obvious influence on the detection result.

After the sample set is constructed, the target detection model may be trained based on the sample image pairs in the sample set.

The training method of the object detection model provided by the present disclosure will be described in detail below with reference to fig. 5 to 9.

Fig. 5 is a flow chart diagram of a training method of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 of the object detection model of this embodiment may include operations S510 to S540. The target detection model comprises a first feature extraction network, a second feature extraction network and a target detection network.

In operation S510, a first live-action image in the sample image pair is input to a first feature extraction network, and first feature data is obtained.

Wherein, the sample image pair is generated by the method described in the foregoing, and the sample image comprises the live-action image and the image to be updated. And the sample image pair has a label indicating actual update information of the live-action image relative to the image to be updated. The actual update information includes an update category and update location information. The update category includes an addition category, a deletion category, and a no change category, and the update position information is represented by the position of the bounding box of the target object.

According to an embodiment of the present disclosure, the first feature extraction network may include a convolutional neural network or a deep residual error network, or the like. For example, a convolutional neural network may include a DarkNet-53 network in a single Look detector V3(You Only Look one V3, Yolo V3) to better balance accuracy and inference speed. It is to be understood that the type of the first feature extraction network described above is merely an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

In operation S520, a first image to be updated in the sample image pair is input to a second feature extraction network, so as to obtain second feature data.

According to an embodiment of the present disclosure, the second feature extraction network may include a convolutional neural network. In the method, the first image to be updated is considered to be the raster image, which is simpler than the live-action image, and therefore, the network structure of the second feature extraction network is simpler than that of the first feature extraction network. For example, the second feature extraction network may include a shallow convolutional neural network with a step size stride of 2 and a layer number of 11, and the size of a convolution kernel in the convolutional neural network may be 3 × 3, so as to increase the receptive field of the raster image.

In operation S530, the first feature data and the second feature data are input into the target detection network, so as to obtain the predicted update information of the first live-action image relative to the first image to be updated.

According to an embodiment of the present disclosure, the target detection network may use an anchor (anchor) framing-based detection method to obtain the prediction update information, for example. For example, the target detection network may first splice the first feature data and the second feature data, and then obtain the prediction update information based on the spliced feature vectors. The target detection network may be a first-order detection network or a second-order detection network, except for the feature extraction structure.

The prediction update information may include predicted position information of the target object (which may be represented by the position of the bounding box of the target object) and a category of the target object. Wherein the class of the target object may be represented by a probability that the target object belongs to an add class, a delete class, and a no change class.

In operation S540, the target detection model is trained based on the actual update information and the predicted update information.

According to embodiments of the present disclosure, a gradient descent algorithm or a back propagation algorithm, or the like, may be employed to train the target detection model based on the difference between the actual update information and the predicted update information. For example, a value of the predetermined loss function may be determined based on the actual update information and the predicted update information, and then the target detection model may be trained based on the value of the predetermined loss function. Specifically, the value of a parameter in the target detection model when the predetermined loss function is minimum may be determined, and the target detection model may be adjusted based on the value of the parameter.

The predetermined loss function may be composed of a category loss function and a location loss function, for example. The category Loss function may be a cross entropy Loss function or a focus Loss (Focal local) modified cross entropy Loss function, etc., and the localization Loss function may be a mean absolute error function, a mean square error Loss function, etc.

In one embodiment, the target detection network may adopt a YOLO v3 network, and the obtained prediction update information may further include a confidence level indicating a probability that the bounding box has an object. The predetermined loss function may include not only a category loss function and a localization loss function, but also a confidence loss function in order to improve the performance of the complex classification. The confidence loss function is divided into two parts, one part is the confidence loss with a target, and the other part is the confidence loss without a target. Thus, when determining the value of the predetermined loss function, the value of the positioning loss function may be determined based on the actual position information and the predicted position information indicated by the label of the sample image pair, so as to obtain a first value. And determining the value of the confidence coefficient loss function based on the actual position information and the confidence coefficient to obtain a second value. And determining the value of the category loss function based on the actual category and the prediction category to obtain a third value. And finally, taking the weighted sum of the first value, the second value and the third value as the value of the predetermined loss function.

Illustratively, the predetermined loss function L may be expressed using the following formula:

L＝λ₁L_GIoU+λ₂L_conf+λ_3Lprob。

wherein L is_GIoUFor the localization loss function, L_confAs a function of confidence loss, L_prob_{Is composed of}A class loss function. Lambda [ alpha ]₁、λ₂、λ₃Are respectively in the direction L_GIoU、L_confAnd L_{prob moiety}And (4) matching the weight. The values of the weight may be set according to actual requirements, for example, all the values may be set to 1, which is not limited in this disclosure.

In one embodiment, the localization loss function may use, for example, the cross-over ratio loss as a local metric to improve the localization accuracy of the predicted bounding box, especially to increase the attention to non-overlapping bounding boxes. For example, the localization loss function may be expressed using the following formula:

wherein Q represents the number of bounding boxes, Di represents the ith bounding box in the prediction update information,

represents the ith bounding box in the actual update information,

representing Di and

is used to define the minimum bounding convex hull (minimum enclosing envelope hull).

Represents D_iAnd

the area of the intersection of (a) and (b),

representing Di and

area of union.

In one embodiment, the confidence loss function may be expressed using the following formula:

wherein S is²Is the number of grid cells in the live view image, and B is the number of anchor frames within the grid cells.

And whether a prediction target object exists in the jth bounding box in the ith unit or not is shown, if so, the value is 1, and otherwise, the value is 0.

Is taken from

The value of (1) is opposite, when the target object is not predicted, the value is 1, otherwise, the value is 0. f. of_ce() The cross-entropy is represented by the cross-entropy,

represents the confidence of the jth bounding box in the ith grid cell,

indicating the confidence of the truth of the jth bounding box in the ith grid cell. The truth confidence may be determined based on actual location information of the target object. Specifically, if it is determined that there is a target object in the ith grid cell according to the actual position information, the confidence of the true value is 1, otherwise, it is 0.α and γ are focus loss parameters, and the values of the focus loss parameters may be set according to actual requirements, for example, α and γ may be set to 0.5 and 2, respectively, which is not limited in this disclosure.

In one embodiment, the class loss function may be represented, for example, by the following equation:

wherein, claThe sses may include correct representing a no change category, to add representing an add category, and to del representing a delete category.

And whether the object appears in the ith grid cell or not is shown, if so, the value is 1, and if not, the value is 0.

And the prediction probability of the target object in the ith grid cell belonging to the b-th category is represented.

Representing the actual probability that the target object in the ith grid cell belongs to the b-th class. If the ith grid unit has the target object of the b type, the actual probability takes a value of 1, otherwise, the actual probability takes a value of 0.

In summary, the target detection model is trained based on the sample image pair, so that the target detection model has the capability of detecting image changes. The input of the trained target detection model is a real image and a raster image obtained by converting a high-precision map to be updated, and the target detection model can directly output predicted updating information. Thus, end-to-end detection of image changes may be achieved. Compared with the related technology, the method can realize the optimal solution of the whole detection task, thereby improving the detection precision. Moreover, because the input image comprises the raster image obtained by high-precision map conversion, the prior information of the high-precision map can be effectively considered, and the precision of the trained target detection model can be further improved.

Fig. 6 is a schematic structural diagram of an object detection model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the target detection network may make a prediction of the update information based on a difference between the first characteristic data and the second characteristic data. Compared with the technical scheme of predicting the update information according to the splicing data of the two characteristic data, the prediction precision can be improved to a certain extent.

Accordingly, the aforementioned object detection network may comprise a parallel cross-difference unit and a feature detection unit. The parallel cross-difference unit is used for calculating the difference between the first characteristic data and the second characteristic data. The feature detection unit is used for predicting the change of the target object of the live-action image compared with the target object indicated by the raster image according to the difference between the two feature data.

Illustratively, as shown in fig. 6, the object detection model in this embodiment 600 includes a first feature extraction network 610, a second feature extraction network 620, and an object detection network 630. The first feature extraction network 610 may be composed of, for example, three first convolution layers having 32, 64, and 128 channels, respectively, to sequentially process the first live-action image 601 and sequentially obtain feature data

The second feature extraction network 620 may be composed of, for example, three second convolution layers having 32, 64, and 128 channels, respectively, to sequentially process the first image to be updated 602 and sequentially obtain feature data

Wherein the convolution kernel of the first convolution layer may be larger than the convolution kernel of the second convolution layer. This is because the first image to be updated is a raster image, and the information contained in the first live-action image 601 is more than the information contained in the first image to be updated, so that the accuracy of feature extraction of the first live-action image 601 can be ensured. The target detection network 630 may include a Parallel Cross Difference (PCD) unit 631 and a Feature detection unit 632, and the Feature detection unit 632 may adopt, for example, a Feature Decoder (FD) of a target detection model in the related art.

Illustratively, the PCD unit 631 may include, for example, an inverting layer and a fusing layer. The inversion layer is used for inverting one feature data of the first feature data and the second feature data. The fusion layer may be configured to add the other of the first feature data and the second feature data to the feature data obtained by inverting, so as to obtain the parallel cross-difference feature. FD unit 632 may process the fused features using an anchor (anchor) framing based detection method to obtain prediction update information. It is to be understood that the structure of the PCD unit 631 is merely an example to facilitate understanding of the present disclosure, and the PCD unit 631 of the present disclosure may also calculate data obtained by subtracting the second characteristic data from the first characteristic data as the first characteristic difference and calculate data obtained by subtracting the first characteristic data from the second characteristic data as the second characteristic difference, for example. And then splicing the first characteristic difference and the second characteristic difference to obtain parallel cross difference data. Wherein, for example, the concat () function may be employed to splice the features.

Based on this, after obtaining the first feature data and the second feature data, the embodiment may obtain the parallel cross-over difference data using the PCD unit 631 based on the first feature data and the second feature data. The obtained parallel cross-over difference data is then input to the FD unit 632, obtaining prediction update information.

For example, the first characteristic data and the second characteristic data may be input to the PCD unit 631, and the parallel cross-over difference data may be output by the PCD unit 631. The parallel cross-over difference data is input to the FD unit 632, and the FD unit 632 outputs the prediction update information 603.

Fig. 7 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure.

According to the embodiment of the present disclosure, not only the feature extraction unit for extracting features may be provided in the first feature extraction network and the second feature extraction network, but also N sequentially connected feature projection layers may be provided after the feature extraction unit to project the extracted features to N different dimensions, and prediction of update information may be performed based on the N different-dimensional features. In this way, the target detection model can learn features of different resolutions of the image conveniently. Meanwhile, the updated information is predicted based on the features with different resolutions, so that the accuracy of the prediction result can be improved. Wherein N is an integer greater than 1.

Accordingly, the embodiment inputs the first live-action image into the first feature extraction network, and the process of obtaining the first feature data may include an operation of inputting the first live-action image into a feature extraction unit included in the first feature extraction network, and using the obtained feature data as the first initial data. Further comprising the following operations: inputting the first initial data into a first projection layer of N characteristic projection layers included in the first characteristic extraction network, inputting data output by the first projection layer into a second projection layer, and so on until data output by the (N-1) th characteristic projection layer is input into the N characteristic projection layer, wherein each projection layer in the N characteristic projection layers can output characteristic data of a first live-action image to obtain N data. For example, the ith projection layer outputs the ith data of the first live view image. The N data constitute first characteristic data.

Similarly, the process of inputting the first image to be updated into the second feature extraction network to obtain the second feature data may include an operation of inputting the first image to be updated into a feature extraction unit included in the second feature extraction network, and using the obtained feature data as the second initial data. The following operations may also be included: and inputting the second initial data into a first projection layer in N characteristic projection layers included in the second characteristic extraction network, inputting data output by the first projection layer into the second projection layer, and so on until data output by the (N-1) th characteristic projection layer is input into the Nth projection layer, wherein each projection layer in the N characteristic projection layers can output characteristic data of a first image to be updated, and N data are obtained. For example, the jth projection layer outputs the jth data of the first image to be updated. The N data constitute second characteristic data.

Illustratively, as shown in fig. 7, in this embodiment 700, taking N as 3 as an example, the first feature extraction network included in the target detection model includes a feature extraction unit 711, a feature projection layer 712, a feature projection layer 713, and a feature projection layer 714. The structure of the feature extraction unit 711 is similar to that of the first feature extraction network described in fig. 6, and includes three convolution layers with channel numbers of 32, 64, and 128, which can respectively extract feature data of the first live-action image 701

Characteristic data

Through the processing of the feature projection layer 712, feature data of size 76 × 256, for example, can be obtained

Characteristic data

Through the processing of the feature projection layer 713, feature data of size 38 × 512, for example, may be obtained

Characteristic data

Through the processing of the feature projection layer 714, feature data with a size of 19 × 1024, for example, can be obtained

Characteristic data

Together constituting first characteristic data. It is to be understood that the sizes of the characteristic data are only taken as examples to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

Similarly, the second feature extraction network included in the target detection model includes a feature extraction unit 721, a feature projection layer 722, a feature projection layer 723, and a feature projection layer 724. The structure of the feature extraction unit 721 is similar to that of the second feature extraction network described in fig. 6, and includes three convolution layers with channel numbers of 32, 64, and 128, which can be extracted to obtain the first image to be updated respectively702 characteristic data

Characteristic data

After being processed by the feature projection layer 722, the feature projection layer 723 and the feature projection layer 724 in sequence, feature data can be obtained respectively

The characteristic data

Together constituting second characteristic data. It will be appreciated that the size of each feature projection layer in the second feature extraction network may be equal to the size of the corresponding feature projection layer in the first feature extraction network, so that the feature data may be processed

Respectively with the size of the feature data

Are equal in size.

According to the embodiment of the disclosure, after obtaining N feature data of different dimensions, the target detection network in the target detection model may calculate a difference value for data of the same dimension in the first feature data and the second feature data, for example, to obtain N difference values. And finally predicting updating information based on the data obtained after the N difference values are spliced. Therefore, the predicted updating information fully considers the characteristics of a plurality of different dimensions, and the problems of missing elements, more redundant elements, localized noise influence and the like existing in the updating information predicted based on a pair of characteristic data are solved.

According to the embodiment of the disclosure, after N pieces of feature data with different dimensions are obtained, the target detection network in the target detection model may perform target detection once, for example, on the basis of data with the same dimension in the first feature data and the second feature data, and perform target detection N times in total. And finally, screening the results of the N times of target detection as candidate prediction information to obtain final prediction updating information. Accordingly, the target detection network may include N parallel cross-difference units, N feature detection units, and an information screening subnetwork, and one parallel cross-difference unit and one feature detection unit may form one detection subnetwork, resulting in N detection subnetworks in total. Each detection subnetwork performs target detection once to obtain candidate prediction information. And the information screening sub-network screens out the final predicted information from the N candidate predicted information.

Specifically, the (N-i +1) th parallel cross difference data may be obtained by using the parallel cross difference unit in the (N-i +1) th detection sub-network based on the ith data of the first live view image and the ith data of the first image to be updated. The (N-i +1) th parallel cross-over difference data is input into the feature detection unit in the (N-i +1) th detection sub-network, and the (N-i +1) th candidate prediction information (i.e. the detection result obtained by detection) can be obtained. The obtained N candidate prediction information is input into an information screening subnetwork, so that the prediction updating information of the first live-action image relative to the first image to be updated can be obtained.

As shown in fig. 7, taking N as an example of 3, 3 PCD units and 3 FD units may constitute a first detector sub-network 731, a second detector sub-network 732, and a third detector sub-network 733. The data from the feature projection layer 724 and the data from the feature projection layer 714 are input to the PCD unit 7311 in the first detector sub-network 731 to obtain the first parallel cross-over difference data. The first parallel cross-correlation data is input to FD unit 7312 in first detection subnetwork 731 to obtain the first candidate prediction information. Similarly, the data output by the feature projection layer 723 and the data output by the feature projection layer 713 are processed by the PCD unit 7321 and the FD unit 7322 in the second detection subnetwork 732, which may result in second candidate prediction information. The data output by the feature projection layer 722 and the data output by the feature projection layer 712 are processed by the PCD unit 7331 and the FD unit 7332 in the third detector sub-network 733, and then the third candidate prediction information is obtained. Information screening subnetwork 734 may, for example, employ NMS methods to screen the final prediction update information 703 from the three candidate prediction information.

Fig. 8 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, when prediction of update information is performed based on features of different dimensions, for example, parallel cross-over difference data obtained based on features of low dimensions may be transferred to a parallel cross-over difference unit that obtains parallel cross-over difference data based on features of high dimensions. Therefore, the powerful functions of the deep learning network are fully utilized, the complex problem is well solved, and the model precision is improved.

Illustratively, the 1 st to (N-1) th sub-networks of the aforementioned N sub-networks may further include a feature propagation unit to propagate the respective resulting cross-difference data to the parallel cross-difference unit in the next sub-network.

In an embodiment, for the case that i is smaller than N, when the aforementioned (N-i +1) th parallel cross-difference data is obtained, the (N-i) th parallel cross-difference data and the ith data of the first live view image may be input to the feature propagation unit of the (N-i) th detection subnetwork to obtain the fused data. Then the fusion data and the ith data of the first image to be updated are input into the parallel cross difference unit of the (N-i +1) th detection sub-network, and the (N-i +1) th parallel cross difference data is output by the parallel cross difference unit of the (N-i +1) th detection sub-network. For the case where i is equal to N, the ith data of the first live view image and the ith data of the first image to be updated may be input to the parallel cross-difference unit in the 1 st detection sub-network, thereby obtaining the 1 st parallel cross-difference data. This is because for the 1 st detector sub-network, there is no previous detector sub-network. The reason why the feature data of the live-action image is fused with the parallel cross-over difference data is that the feature data of the live-action image can reflect the fine-grained features of the target object in the real scene, which is not limited by the present disclosure. For example, in another embodiment, the input data of the feature propagation unit may be feature data and parallel cross-difference data of the image to be updated, so that the feature data and the parallel cross-difference data of the image to be updated are fused first.

For example, as shown in fig. 8, in this embodiment 800, the first real image 801 with a size of 608 × 608 is processed by the feature extraction unit 821 and the three feature projection layers 822-824 to obtain the 3 rd data

The first to-be-updated image 802 with a size of 608 × 608 is processed by a feature extraction unit 811 and three feature projection layers 812-814 to obtain the 3 rd data

The two 3 rd data are input to a PCD unit 8311 in the first detection subnetwork 831, the output of the PCD unit 8311 is input to an FD unit 8312 and a first Feature Propagation (FP) unit 8313, and after being processed by the FD unit 8312, the output candidate update information is input to an information screening subnetwork 834. While passing feature data output via feature projection layer 813

As inputs to the FP unit 8313, data and feature data output to the PCD unit 8311 via the FP unit 8313

The resulting output of the processing is provided as an input to PCD unit 8321 in second detection subnetwork 832. While passing through the feature data to be output via the feature projection layer 823

Data and feature data output to the FP unit 8313 via the PCD unit 8321 as input to the PCD unit 8321

The resulting output of the processing is taken as input to the FD unit 8322 and the FP unit 8323. By analogy, the FD unit 8322 may be used to outputCandidate update information that is output by FD section 8332 in third detection sub-network 833 processing data output from PCD 8331. The information screening sub-network 834 may output the prediction update information 803 after processing the candidate update information based on the NMS method.

By the embodiment, the features output by the PCD unit in the coarser granularity can be transmitted through the FP unit to enlarge the feature scale to the finer granularity, and then the features are spliced with the live-action image features according to the scale proportion, so that the fusion of different dimensional features can be realized, and the accuracy of the obtained prediction updating information is improved.

According to an embodiment of the present disclosure, the present disclosure uses a VSCD data set to study the effect of PCD units and FP units on the detection results based on the target detection model described in fig. 6. As shown in Table 1 below, Diff-Net refers to the model in FIG. 6 with the PCD units removed from the model. The present disclosure simply concatenates features extracted from different branches (live-action image branch and image-to-be-updated branch) as features of downstream units. The downstream unit may include neither a PCD unit nor a FP unit, may include only a PCD unit (may form a model in fig. 7), and may also include both a PCD unit and a FP unit (may form a model in fig. 8). Whereby differences between features of different branch extractions are calculated by the downstream units. The performance of the several downstream unit structures was evaluated using Mean Average Precision (MAP) as a metric, and the evaluation results are shown in table 1. As can be seen from table 1, the introduction of the PCD unit can improve MAP by 8%. The MAP can be further improved by 7.6% by introducing FP units on the basis of PCD units to propagate features from thicker layers to thinner layers.

TABLE 1

Fig. 9 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, when the live-action image is a video frame in the live-action video, a recurrent neural network unit may be further disposed in the target detection model, so as to facilitate capturing of correlation between the prediction update information and time, and improve the accuracy of the target detection model. This is because in the detection of objects in the traffic field, the collected data is usually a video stream, not a sparse image.

Illustratively, the recurrent neural network unit can form a 1 st detection sub-network with a parallel cross-difference unit, a target detection unit and a feature propagation unit. In this way, when obtaining the 1 st parallel cross-difference data, the nth data of the first live view image and the nth data of the first image to be updated may be input into the parallel cross-difference unit in the 1 st detection sub-network, and the output of the parallel cross-difference unit may be used as the initial cross-difference data. Then inputting the initial cross difference data and the current state data of the recurrent neural network unit into the recurrent neural network unit, and outputting the 1 st parallel cross difference data by the recurrent neural network unit. Wherein the current state data of the recurrent neural network unit is obtained by the recurrent neural network given to a previous video frame of the current video frame.

Wherein, the recurrent neural network unit can be a long-term and short-term memory network unit. The state data of the long and short term memory network unit includes hidden state data and cell state data. Wherein, the hidden state data can be used as the 1 st parallel cross difference data. The long and short term memory network unit may be implemented by using an ELU activation function and a layer normalization function, which is not limited in this disclosure.

Illustratively, as shown in fig. 9, the object detection model in this embodiment 900 is similar to the object detection model described in the foregoing fig. 8, except that in the object detection model of this embodiment, the PCD unit 9311, the FD unit 9312, the FP unit 9313, and the convolutional long short term memory network unit (Conv LSTM)9314 together form a first sub-network of detection. It will be understood that like reference numerals in fig. 9 and 8 refer to like elements.

In this embodiment, the sample image pair is set to include at least two samples each composed of two video frames in the live-view videoThe present image pair. The two video frames are respectively the (k-1) th video frame and the k-th video frame in the live-action video. (k-1) th video frame as live-action image p _k-1902, and an image p to be updated_k-1901 together form a first sample image pair. The k-th video frame is used as a live-action image pk 902 'and forms a second sample image pair together with the image pk 901' to be updated. In the process of training the target detection model, two images in the first sample image pair may be input into the target detection model in parallel, and the updated state data Cp may be obtained by using the data output from the current state data Cpk-2903 and Hpk-2904 of the convolution long-short term memory network unit 9314 and the PCD9311 unit as the input of the convolution long-short term memory network unit 9314_k1903 'and Hpk-1904'. The state data Hpk-1904' may be input to the FD unit 9312 and the FP unit 9313 as parallel cross-over difference data, and the prediction update information 905 is obtained after subsequent processing. Subsequently, the two images in the second sample image pair may be input into the target detection model in parallel, and the current state data Cpk-1903 'and Hpk-1904' of the convolutional long-short term memory network unit 9314 and the data output by the PCD9311 unit may be used as the input of the convolutional long-short term memory network unit 9314, so as to obtain the updated state data Cpk903 "and Htk 904". The state data Hpk904 ″ may be input to the FD unit 9312 and the FP unit 9313 as parallel cross-over data, and the prediction update information 905' is obtained after the subsequent processing. In the training process, sample image pairs formed by all video frames in the live-action video can be sequentially input into the target detection model to obtain a plurality of prediction updating information. The target detection model is optimized once based on the plurality of predictive update information.

Fig. 10 is a schematic structural diagram of a parallel cross-difference unit according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 10, in this embodiment 1000, the parallel cross-difference unit 1010 may include a first inversion layer 1011, a second inversion layer 1012, a first splice layer 1013, a second splice layer 1014, a third splice layer 1015, and a data fusion layer 1016. Accordingly, the PCD unit in the aforementioned (N-i +1) th detector sub-network is employedWhen the (N-i +1) th parallel cross-difference data is obtained, the first data input to the PCD unit may be input to the first inversion layer 1011 and processed by the first inversion layer 1011 to obtain first inverted data. For example, the first data may be ith data of the image to be updated extracted by the feature extraction unit (e.g., the first data is the ith data of the image to be updated

) After the processing by the first inversion layer 1011, the data can be obtained

Similarly, second data input into the PCD unit may be used as an input into the second inversion layer 1012, and the second inversion layer 1012 processes the second data to obtain second inverted data. For example, if i is equal to N, the second data may be nth data of the live view image. If i is smaller than N, the second data can be fusion data obtained by fusing ith data and (N-i) th parallel cross difference data of the live-action image. The 3 rd data of the live-action image described above is the second data

For example, data can be obtained after processing through the second inversion layer 1012

With the first data and the second data as inputs to the first splice layer 1013, the first splice data can be obtained. Second stitching data may be obtained using the second data and the first inverted data as inputs to the second stitching layer 1014. Third stitching data may be obtained with the first data and the second data as inputs to the third stitching layer 1015. Finally, the first splicing data, the second splicing data and the third splicing data are input into the data fusion layer 1016, so that the (N-i +1) th parallel cross difference data is obtained.

Illustratively, when the PCD unit forms the first detection subnetwork, the data output by the data fusion layer 1016 may be input into the recurrent neural network unit, so that the data output by the recurrent neural network unit is taken as the 1 st parallel cross-difference data.

Illustratively, as shown in fig. 10, the first feature extraction network and the second feature extraction network may further set a conversion layer after each feature projection layer to respectively convert feature data obtained by the first feature extraction network into feature data

Converting the c channel dimensions into c/2 channel dimensions to obtain characteristic data

Feature data obtained by second feature extraction network

Wherein, when N is 3, the value of d can be 4, 5, 6. The first data is characteristic data

The second data is

Or is based on

And obtaining fused data with the (N-i) th parallel cross difference data. The size of the convolution kernel in the two translation layers may be, for example, 3 x 3.

For example, as shown in fig. 10, the PCD unit 1010 may further include a plurality of convolutional layers 1017 connected in sequence to perform channel number conversion on the data output by the data fusion layer 1016, so that the obtained feature data can better express the difference between the feature of the live-action image and the image to be updated. For example, the plurality of convolutional layers 1017 is 4, and the convolutional kernel sizes of the 4 convolutional layers are 3 × 3, 1 × 1, 3 × 3, and 1 × 1, respectively. The 4 convolutional layers are used to convert the output of the data fusion layer 1016 from 3c/2 channel dimension to c channel dimension, convert the data from c channel dimension to c/2 channel dimension, and convert the data from c/2 channel dimension to c channel dimensionConverting the data from c channel dimensions to c/2 channel dimensions to finally obtain the characteristic data output by the PCD unit 1010

Wherein the characteristic data

May be used as input to detect FD unit 1020 and FP unit 1030 in the subnetwork. It will be appreciated that the characteristic data is provided when the recurrent neural network elements are disposed in the 1 st detection subnetwork

As inputs to the recurrent neural network element, the output data of the recurrent neural network element is used as an input to FD unit 1020 and FP unit 1030 in the detection subnetwork.

As shown in fig. 10, the FD unit 1020 may be provided with a convolution layer 1021 whose convolution kernel size may be 3 × 3 for boosting the channel dimension of the input data from c/2 to c. Convolutional layer 1022, which may have a convolutional kernel size of 1 × 1, is then used to generate the proposed region. Thus, channel data was obtained as S.times.S.times. [ 3X (num)_class+5) tensor. Wherein, num_classRepresenting the number of target object classes (e.g., 3, classes are the aforementioned add, delete, and no change, respectively), 5 representing 4 channels of bounding box locations and 1 channel of confidence, 3 representing S²The number of anchor boxes in the grid cell. S may take on a value of 7, for example.

Similar to the YOLO v3 model, the FD unit 1020 has two branches for performing change detection of the target object, one branch using the softmax operation to output the change category of the target object, and the other branch for inferring the geometric position (u, v, w, h) of the target object based on the width prior knowledge w and the height prior knowledge h. Finally, NMS methods may be employed to eliminate redundant detection.

As shown in fig. 10, the FP unit 1030 may include a convolutional layer 1031 and an upsampling layer 1032. Via the convolution layers 1031 and 1031The up-sampling layer 1032 may reduce the channel dimension of the input data from c/2 to c/4. The feature data with channel dimension c/4 may be associated with the feature data when the FP unit 1030 belongs to the 1 st to (N-1) th detector sub-networks

Fusion is performed. Wherein, in the case that the first feature extraction network and the second feature extraction network are provided with conversion layers behind the feature projection layers, the feature data with the channel dimension of c/4 can be combined with the feature data

The fusion is performed via the concat () function. As shown in fig. 10, the FP unit 1030 may also be provided with another convolutional layer 1033 to get finer-scale feature data to pass to the PCD unit in the next detection sub-network. For example, convolution layer 1033 has a convolution kernel size of 1 × 1, and the channel dimension of data can be converted from 3c/4 to c/2 by processing through convolution layer 1033.

It is to be understood that the structural arrangement of the PCD unit, the FD unit and the PF unit described above is merely an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto. For example, the convolution kernel size, the number of convolution layers, and the like of the convolution layers in each unit can be set according to actual needs.

The target detection model obtained by training based on the model structures in fig. 6, fig. 8 and fig. 10 is tested by using the collected SICD data set and the VSCD data set, and the performance of the several models is evaluated in a performance manner by using MAP as a metric, and the evaluation result is shown in table 2 below. According to the test result, compared with the traditional method, the performance of the three models is greatly improved, the network model of end-to-end learning realizes the joint optimization of the change detection task, and the overall performance is improved. The models in fig. 8 and 10 have significantly improved performance compared to the model in fig. 6. The model in fig. 10 performs better in terms of video data, reaching a MAP of 76.1%.

TABLE 2

The present disclosure also introduces a data set R-VSCD acquired in a real scene, the data in the data set indicating changes in the target objects in the real scene. Based on the data set R-VSCD, the present disclosure evaluates the performance of the object detection model trained from the model structures in fig. 6, 8, and 10 in the high-precision map change detection scenario. Because the data volume collected by the data set R-VSCD is limited, no meaningful MAP can be generated. The present disclosure employs top-1 accuracy to evaluate the performance of each model in this high-precision map change detection scenario. The top-1 precision refers to the accuracy rate of the category with the maximum probability corresponding to the actual category, if the category with the maximum probability in the prediction result corresponds to the actual category, the prediction is accurate, and if not, the prediction is wrong. The evaluation results are shown in table 3 below. Based on the evaluation results, the accuracy of the model in fig. 10 can reach 81%. Wherein the data set R-VSCD may comprise data collected from a plurality of different cities. Due to the fact that the target objects such as traffic lights and the like in different cities are obviously different in arrangement, higher challenges can be provided for a target detection model to a certain extent, and the fact that the generalization capability of the end-to-end learning network model provided by the method is higher can be demonstrated to a certain extent.

Method	Top-1 precision
		YoLov3+D	0.558
Diff_Net	0.725
		Diff_Net+ConvLSTM	0.810

TABLE 3

Through testing, the present disclosure finds that features at the coarse scale are more focused on larger objects, while features at the fine scale are more focused on smaller objects, consistent with the design objectives of the present disclosure. By extracting and comparing a plurality of scale features, the accurate identification of objects with different sizes can be realized, and the accuracy of the determined prediction updating information is improved.

According to an embodiment of the present disclosure, the training and use of the PaddlePaddle platform and the TensorFlow framework of the target detection model of the present disclosure is implemented. In training the target detection model, 8 NVIDIA Tesla P40 graphics processors may be used to run on the workstation. During the training process, for example, Adam optimization algorithm can be adopted to set the learning rate as e-4. The batch size during training is 48, and when Epoch reaches 40, training is stopped. For the training of the model shown in fig. 10, the batch size is set to 8.

Based on the training method of the target detection model, the disclosure also provides a method for determining the update information of the image by using the target detection model. The method will be described below with reference to fig. 11.

FIG. 11 is a flow chart of a method of determining updated information for an image using a target detection model according to an embodiment of the disclosure.

As shown in fig. 11, the method 1100 of determining update information of an image using an object detection model of this embodiment may include operations S1110 to S1140. The target detection model is obtained by training using the training method described above.

In operation S1110, a second image to be updated corresponding to the second live view image is determined.

According to the embodiment of the disclosure, an image corresponding to the second live-action image in the offline-map may be determined first, and the image may be used as an initial image. The initial image is then converted into a raster image based on the target object comprised by the initial image, resulting in a second image to be updated. It is understood that the method for determining the second image to be updated is similar to the method for determining the grid image described above, and will not be described herein again.

It will be appreciated that the image in the off-line map corresponding to the second live view image may be determined based on the pose of the image capture device with respect to the second live view image. The posture of the image acquisition device for the second live-action image is the posture of the image acquisition device when acquiring the second live-action image. The method for obtaining the corresponding image based on the posture of the image capturing device is similar to the method described above, and is not described herein again.

In operation S1120, the second live view image is input to the first feature extraction network of the target detection model, and third feature data is obtained. The operation S1120 is similar to the operation S510 described above, and is not described herein again.

In operation S1130, the second image to be updated is input to the second feature extraction network of the target detection model, so as to obtain fourth feature data. Operation S1130 is similar to operation S520 described above and will not be described herein.

In operation S1140, the third feature data and the fourth feature data are input into the target detection network of the target detection model, so as to obtain update information of the second live-action image relative to the second image to be updated. The operation S1140 is similar to the operation S530 described above and will not be described herein again.

According to an embodiment of the present disclosure, the aforementioned off-line map may be a high-precision map. The method for determining the update information of the image by adopting the target detection model can be applied to a scene for updating the high-precision map.

Correspondingly, the present disclosure also provides a method for updating a high-precision map, which may first determine an image in the high-precision map corresponding to the acquired live-action image (e.g., a third live-action image), and obtain a third image to be updated. And then, obtaining the update information of the third live-action image relative to the third image to be updated by adopting the method for determining the update information of the image by adopting the target detection model. So that the high-precision map can be updated based on the update information. For example, if the determined update information includes the target object of the deletion category and the position information of the target object, the target object in the high-precision map may be located based on the position information, and the located target object may be deleted from the high-precision map. The method for obtaining the third image to be updated is similar to the method for obtaining the second image to be updated described above, and is not described herein again.

Based on the foregoing method of generating a sample image pair, the present disclosure also provides an apparatus for generating a sample image pair. The apparatus will be described in detail below with reference to fig. 12.

Fig. 12 is a block diagram of an apparatus for generating a sample image pair according to an embodiment of the present disclosure.

As shown in fig. 12, the apparatus 1200 of generating a sample image pair of this embodiment may include a first position determination module 1210, a category combination determination module 1220, a second position determination module 1230, a first image generation module 1240, and a first label adding module 1250.

The first position determination module 1210 is configured to determine position information of a target object included in a live view image based on first map information corresponding to the live view image. In an embodiment, the first position determining module 1210 may be configured to perform the operation S110 described above, which is not described herein again.

The category combination determination module 1220 is configured to determine a random category combination of the target objects included in the live view image based on the predetermined category. Wherein the predetermined categories may include an add category and a no change category. In an embodiment, the category combination determining module 1220 may be configured to perform the operation S120 described above, and is not described herein again.

The second position determination module 1230 is configured to determine the position information of the deletion object for the live view image based on the position information of the target object included in the live view image. In an embodiment, the second position determining module 1230 may be configured to perform the operation S130 described above, and is not described herein again.

The first image generation module 1240 is configured to generate a first image to be updated for the live-action image based on the position information of the deletion object, the position information of the target object included in the live-action image, and the random category combination. In an embodiment, the first image generation module 1240 may be configured to perform the operation S140 described above, and is not described herein again.

The first labeling module 1250 may be configured to label an image pair consisting of the first image to be updated and the live view image. In an embodiment, the first tag adding module 1250 can be configured to perform the operation S150 described above, and is not described herein again.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a second image generation module configured to generate a first raster image for the live-action image based on the first map information and the position information of the target object included in the live-action image, where the first raster image indicates a position of the target object included in the live-action image. The first image generation module 1240 is specifically configured to: and adjusting the first raster image based on the position information and the random category combination of the deleted object to obtain a first image to be updated. The first image to be updated indicates the position of the deletion object in the live-action image and the position of the target object of the unchanged category in the live-action image.

According to an embodiment of the present disclosure, the second position determination module includes: a candidate region determination sub-module for determining a first region in the first raster image based on the predetermined position distribution information, the first region including, as a candidate region, a region whose distribution density determined based on the position distribution information is greater than a predetermined density; and a position determination sub-module for determining position information of the deletion object based on other regions of the candidate regions except for a second region including a region indicating a position of the target object included in the live-action image. Wherein the predetermined position distribution information is determined based on position information of a target object included in the plurality of live-action images, the plurality of live-action images being equal in size to each other.

According to an embodiment of the present disclosure, the position determination submodule includes a size determination unit and a position determination unit. The size determination unit is used for determining the target size based on the size of the second area and the number of target objects included in the live-action image. The position determining unit is used for determining any area with the size equal to the target size in other areas to obtain the position information of the deleted object.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include an image obtaining module, configured to obtain an image satisfying a predetermined position constraint condition from a live-action image library, so as to obtain the live-action image.

According to an embodiment of the present disclosure, the live-action image includes a target frame in a live-action video, and the apparatus 1200 for generating a sample image pair may further include a third position determining module, a category determining module, a spatial information determining module, a third image generating module, and a second label adding module. The third position determining module is used for determining the position information of the target object included in other frames except the target frame in the live-action video based on the second map information corresponding to the other frames. The category determination module is used for determining the category of the target object included in other frames based on the category of the target object included in the target frame. The spatial information determination module is used for determining three-dimensional spatial information of the deleted object based on the depth of the target object and the position information of the deleted object included in the target frame. And the third image generation module is used for generating a second image to be updated aiming at other frames based on the three-dimensional space information and the position information and the category of the target object included by other frames. And the second label adding module is used for adding labels to the image pair consisting of the second image to be updated and other frames.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair described above may further include a fourth image generation module configured to generate a second raster image for the other frame based on the second map information and the position information of the target object included in the other frame, the second raster image indicating the position of the target object included in the other frame. The third image generation module is used for adjusting the second raster image based on the three-dimensional space information and the type of the target object included in other frames to obtain a second image to be updated. The second image to be updated indicates the position of the deletion object in the other frame and the position of the target object of the unchanged category in the other frame.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a map information determining module, configured to determine first map information corresponding to the live-action image in the offline map based on a posture of the image capturing apparatus with respect to the live-action image.

Based on the training method of the target detection model provided by the disclosure, the disclosure also provides a training device of the target detection model. The apparatus will be described in detail below with reference to fig. 13.

Fig. 13 is a block diagram of a structure of a training apparatus of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 13, the training apparatus 1300 of the object detection model of this embodiment may include a first data obtaining module 1310, a second data obtaining module 1320, an update information predicting module 1330, and a model training module 1340. The target detection model comprises a first feature extraction network, a second feature extraction network and a target detection network.

The first data obtaining module 1310 is configured to input a first live-action image in the sample image pair into a first feature extraction network, so as to obtain first feature data. In an embodiment, the first data obtaining module 1310 may be configured to perform the operation S510 described above, and is not described herein again.

The second data obtaining module 1320 is configured to input the first image to be updated in the sample image pair into the second feature extraction network, so as to obtain second feature data. Wherein the sample image pair has a label indicating actual update information of the first live view image relative to the first image to be updated. In an embodiment, the second data obtaining module 1320 may be configured to perform the operation S520 described above, which is not described herein again.

The update information prediction module 1330 is configured to input the first feature data and the second feature data into the target detection network, so as to obtain the prediction update information of the first live-action image relative to the first image to be updated. In an embodiment, the update information prediction module 1330 may be configured to perform the operation S530 described above, which is not described herein again.

The model training module 1340 is configured to train the target detection model based on the actual update information and the predicted update information. In an embodiment, the model training module 1340 may be configured to perform the operation S540 described above, which is not described herein again.

According to an embodiment of the present disclosure, an object detection network includes a parallel cross-difference unit and a feature detection unit. The update information prediction module 1330 may include a cross difference obtaining sub-module and an update prediction sub-module. And the cross difference obtaining submodule is used for obtaining parallel cross difference data by adopting a parallel cross difference unit based on the first characteristic data and the second characteristic data. And the updating prediction submodule is used for inputting the obtained parallel cross difference data into the characteristic detection unit to obtain prediction updating information.

According to the embodiment of the disclosure, each of the first feature extraction network and the second feature extraction network comprises a feature extraction unit and N feature projection layers connected in sequence, wherein N is an integer greater than 1. The first data obtaining module may include a first data obtaining sub-module and a second data obtaining sub-module. The first data obtaining submodule is used for inputting the first real image into a feature extraction unit included in the first feature extraction network to obtain first initial data. The second data obtaining submodule is used for inputting the first initial data into a first projection layer in N feature projection layers included in the first feature extraction network to obtain ith data of the first live-action image output by the ith projection layer. The second data obtaining module comprises a third data obtaining submodule and a fourth data obtaining submodule. The third data obtaining submodule is used for inputting the first image to be updated into a feature extraction unit included in the second feature extraction network to obtain second initial data. The fourth data obtaining submodule is used for inputting the second initial data into a first projection layer in the N feature projection layers included in the second feature extraction network to obtain jth data of the first image to be updated output by the jth projection layer.

According to an embodiment of the present disclosure, an object detection network includes an information screening subnetwork, N parallel cross-difference units, and N feature detection units. Wherein one parallel cross-difference unit and one feature detection unit form a sub-network of detectors. And the cross difference obtaining submodule is used for obtaining (N-i +1) th parallel cross difference data by adopting a parallel cross difference unit in an (N-i +1) th detection sub-network based on the ith data of the first live-action image and the ith data of the first image to be updated. The update prediction submodule includes a candidate information obtaining unit and an information filtering unit. The candidate information obtaining unit is used for inputting the (N-i +1) th parallel cross difference data into the feature detection unit in the (N-i +1) th detection sub-network to obtain the (N-i +1) th candidate prediction information. The information screening unit is used for inputting the obtained N candidate prediction information into the information screening subnetwork to obtain the prediction updating information of the first live-action image relative to the first image to be updated.

According to an embodiment of the present disclosure, the 1 st to (N-1) th detector sub-networks each further comprise a feature propagation unit. The cross difference obtaining sub-module may include a data fusion unit and a cross difference obtaining unit. And the data fusion unit is used for inputting the (N-i) th parallel cross difference data and the ith data of the first real scene image into the feature propagation unit of the (N-i) th detection sub-network to obtain fusion data under the condition that i is smaller than N. The cross difference obtaining unit is used for inputting the fusion data and ith data of the first image to be updated into the parallel cross difference unit of the (N-i +1) th detection sub-network to obtain (N-i +1) th parallel cross difference data. And the cross difference obtaining unit is also used for inputting the ith data of the first live-action image and the ith data of the first image to be updated into the parallel cross difference unit in the 1 st detection sub-network to obtain the 1 st parallel cross difference data under the condition that i is equal to N.

According to an embodiment of the present disclosure, the first live view image comprises a video frame, and the 1 st detection subnetwork further comprises a recurrent neural network element. The above-mentioned cross difference obtaining unit is used for obtaining the 1 st parallel cross difference data by the following way: inputting the Nth data of a first live-action image and the Nth data of the first image to be updated into a parallel cross difference unit in a 1 st detection sub-network to obtain initial cross difference data; and inputting the initial cross difference data and the state data of the recurrent neural network unit into the recurrent neural network unit to obtain the 1 st parallel cross difference data. Wherein the state data is obtained by the recurrent neural network elements based on a previous video frame of the current video frame.

According to an embodiment of the present disclosure, the parallel cross difference unit may include a first inversion layer, a second inversion layer, a first splicing layer, a second splicing layer, a third splicing layer, and a data fusion layer. The cross difference obtaining submodule is used for obtaining the (N-i +1) th parallel cross difference data by the following modes: taking the first data input into the parallel cross difference unit as the input of a first inversion layer to obtain first inversion data of the first data; taking the second data input into the parallel cross difference unit as the input of a second inversion layer to obtain second inversion data of the second data; taking the first data and the second data as the data; inputting a first splicing layer to obtain first splicing data; taking the second data and the first negation data as the input of a second splicing layer to obtain second splicing data; taking the first data and the second negation data as the input of a third splicing layer to obtain third splicing data; and inputting the first splicing data, the second splicing data and the third splicing data into a data fusion layer to obtain the (N-i +1) th parallel cross difference data.

According to an embodiment of the present disclosure, model training module 1340 may include a loss determination sub-module and a model training sub-module. The loss determination submodule is used for determining the value of the predetermined loss function based on the actual updating information and the prediction updating information. And the model training submodule is used for training the target detection model based on the value of the predetermined loss function.

According to an embodiment of the present disclosure, the actual update information includes actual location information and an actual category of the target object; the prediction update information includes predicted position information, a prediction category, and a confidence of the target object. The loss determination submodule may include a first determination unit, a second determination unit, a third determination unit, and a loss determination unit. The first determining unit is used for determining a value of a positioning loss function in a preset loss function based on the actual position information and the predicted position information to obtain a first value. The second determining unit is used for determining the value of the confidence coefficient loss function in the preset loss function based on the actual position information and the confidence coefficient to obtain a second value. The third determining unit is configured to determine a value of a category loss function in the predetermined loss function based on the actual category and the prediction category, and obtain a third value. And the loss determining unit is used for determining the weighted sum of the first value, the second value and the third value to obtain the value of the predetermined loss function.

Based on the method for determining the update information of the image by adopting the target detection model provided by the disclosure, the disclosure also provides a device for determining the update information of the image by adopting the target detection model. This device will be described in detail below with reference to fig. 14.

Fig. 14 is a block diagram of an apparatus for determining update information of an image using an object detection model according to an embodiment of the present disclosure.

As shown in fig. 14, the apparatus 1400 for determining update information of an image using an object detection model of this embodiment may include an image determination module 1410, a third data acquisition module 1420, a fourth data acquisition module 1430, and an update information determination module 1440. The target detection model is obtained by training with the training device of the target detection model described above.

The image determining module 1410 is configured to determine a second image to be updated corresponding to the second live-action image. In an embodiment, the image determining module 1410 may be configured to perform the operation S1110 described above, which is not described herein again.

The third data obtaining module 1420 is configured to input the second live-action image into the first feature extraction network of the target detection model, so as to obtain third feature data. In an embodiment, the third data obtaining module 1420 may be configured to perform the operation S1120 described above, which is not described herein again.

The fourth data obtaining module 1430 is configured to input the second image to be updated into the second feature extraction network of the target detection model, so as to obtain fourth feature data. In an embodiment, the fourth data obtaining module 1430 may be configured to perform the operation S1130 described above, which is not described herein again.

The update information determining module 1440 is configured to input the third feature data and the fourth feature data into the target detection network of the target detection model, so as to obtain update information of the second live-action image relative to the second image to be updated. In an embodiment, the update information determining module 1440 may be configured to perform the operation S1140 described above, which is not described herein again.

According to an embodiment of the present disclosure, the image determination module 1410 may include an image determination sub-module and an image conversion sub-module. And the image determining submodule is used for determining an image corresponding to the second live-action image in the off-line map as an initial image. The image conversion submodule is used for converting the initial image into a raster image based on a target object included in the initial image to obtain a second image to be updated.

According to an embodiment of the disclosure, the image determination sub-module is configured to determine an image in the offline map corresponding to the second live-action image based on a pose of the image capture device with respect to the second live-action image.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 15 shows a schematic block diagram of an example electronic device 1500 that may be used to implement methods of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the apparatus 1500 includes a computing unit 1501 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM1503, various programs and data necessary for the operation of the device 1500 can also be stored. The calculation unit 1501, the ROM 1502, and the RAM1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

Various components in device 1500 connect to I/O interface 1505, including: an input unit 1506 such as a keyboard, a mouse, and the like; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508, such as a magnetic disk, optical disk, or the like; and a communication unit 1509 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1509 allows the device 1500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1501 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of the computation unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The calculation unit 1501 executes the respective methods and processes described above, for example, at least one of the following methods: a method of generating a sample image pair, a method of training a target detection model, and a method of determining update information for an image using a target detection model. For example, in some embodiments, at least one of a method of generating a sample image pair, a method of training a target detection model, and a method of determining updated information for an image using a target detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into the RAM1503 and executed by the computing unit 1501, one or more steps of at least one of the above-described method of generating a pair of sample images, the training method of the target detection model, and the method of determining update information of an image using the target detection model may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured in any other suitable way (e.g., by means of firmware) to perform at least one of the following methods: a method of generating a sample image pair, a method of training a target detection model, and a method of determining update information for an image using a target detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of generating a sample image pair, comprising:

determining position information of a target object included in a target live-action image based on first map information corresponding to the target live-action image;

determining a random class combination of the target objects included in the target live-action image based on a predetermined class;

determining position information of a deletion object for the target live-action image based on position information of a target object included in the target live-action image; and

generating a first image to be updated for the target live-action image based on the position information of the deletion object, the position information of the target object included in the target live-action image, and the random class combination, and adding a label to an image pair composed of the first image to be updated and the target live-action image,

wherein the tag indicates actual update information of the target live-action image relative to the first image to be updated; the predetermined categories include an add category and a no change category.

2. The method of claim 1, further comprising:

generating a first raster image for the target live-action image based on the first map information and position information of a target object included in the target live-action image, the first raster image indicating a position of the target object included in the target live-action image;

wherein generating a first image to be updated for the target live-action image comprises: adjusting the first raster image based on the position information of the deleted object and the random category combination to obtain the first image to be updated,

wherein the first image to be updated indicates a position of the deletion object in the target live-action image and a position of a target object of an unchanged category in the target live-action image.

3. The method of claim 2, wherein determining location information for a deleted object comprises:

determining a first region in the first raster image as a candidate region based on predetermined position distribution information, the first region including a region whose distribution density determined based on the position distribution information is greater than a predetermined density; and

determining position information of the deletion object based on other areas of the candidate areas except for a second area including an area indicating a position of a target object included in the target live-action image,

wherein the predetermined position distribution information is determined from position information of a target object included in a plurality of target live-action images; the sizes of the plurality of target live-action images are equal to each other.

4. The method of claim 3, wherein the determining location information of the deletion object based on other of the candidate regions except the second region comprises:

determining a target size based on the size of the second area and the number of target objects included in the target live-action image; and

and determining any area with the size equal to the target size in the other areas to obtain the position information of the deleted object.

5. The method of claim 1, further comprising:

and acquiring an image meeting the preset position constraint condition from the live-action image library to obtain the target live-action image.

6. The method of claim 1, wherein the target live-action image comprises a target frame in a live-action video; the method also comprises the following steps of aiming at other frames except the target frame in the live-action video:

determining position information of a target object included in the other frame based on second map information corresponding to the other frame;

determining the category of the target object included in the other frames based on the category of the target object included in the target frame;

determining three-dimensional space information of the deleted object based on the depth of the target object included in the target frame and the position information of the deleted object; and

and generating a second image to be updated for the other frame based on the three-dimensional space information and the position information and the category of the target object included in the other frame, and adding a label to an image pair formed by the second image to be updated and the other frame.

7. The method of claim 6, further comprising:

generating a second raster image for the other frame based on the second map information and position information of a target object included in the other frame, the second raster image indicating a position of the target object included in the other frame;

wherein generating a second image to be updated corresponding to the other frame comprises: adjusting the second raster image based on the three-dimensional spatial information and the category of the target object included in the other frames to obtain the second image to be updated,

wherein the second image to be updated indicates a position of the deletion object in the other frame and a position of a target object of a non-change category in the other frame.

8. The method of claim 1, further comprising:

and determining first map information corresponding to the target live-action image in an offline map based on the posture of the image acquisition device aiming at the target live-action image.

9. An apparatus for generating a sample image pair, comprising:

the first position determining module is used for determining the position information of a target object included in a target live-action image based on first map information corresponding to the target live-action image;

a category combination determination module, configured to determine, based on a predetermined category, a random category combination of target objects included in the target live-action image;

a second position determination module configured to determine position information of a deletion object for the target live-action image based on position information of a target object included in the target live-action image;

a first image generation module, configured to generate a first image to be updated for the target live-action image based on the position information of the deleted object, the position information of the target object included in the target live-action image, and the random category combination; and

a first label adding module for adding a label to an image pair composed of the first image to be updated and the target live-action image,

10. The apparatus of claim 9, further comprising:

a second image generation module to generate a first raster image for the target live-action image based on the first map information and position information of a target object included in the target live-action image, the first raster image indicating a position of the target object included in the target live-action image;

wherein the first image generation module is configured to: adjusting the first raster image based on the position information of the deleted object and the random category combination to obtain the first image to be updated,

11. The apparatus of claim 10, wherein the second position determination module comprises:

a candidate region determination sub-module configured to determine, as a candidate region, a first region in the first raster image based on predetermined position distribution information, the first region including a region in which a distribution density determined based on the position distribution information is greater than a predetermined density; and

a position determination sub-module that determines position information of the deletion object based on other areas of the candidate areas except for a second area including an area indicating a position of a target object included in the target live-action image,

12. The apparatus of claim 11, wherein the location determination submodule comprises:

a size determination unit configured to determine a target size based on the size of the second region and the number of target objects included in the target live-action image; and

and the position determining unit is used for determining any area with the size equal to the target size in the other areas to obtain the position information of the deleted object.

13. The apparatus of claim 9, further comprising:

and the image obtaining module is used for obtaining the image meeting the preset position constraint condition from the live-action image library to obtain the target live-action image.

14. The apparatus of claim 9, wherein the target live-action image comprises a target frame in a live-action video; the device comprises:

a third position determining module, configured to determine, for other frames in the live-action video except the target frame, position information of a target object included in the other frames based on second map information corresponding to the other frames;

a category determination module, configured to determine, based on a category of a target object included in the target frame, a category of a target object included in the other frame;

a spatial information determination module, configured to determine three-dimensional spatial information of the deleted object based on the depth of the target object included in the target frame and the position information of the deleted object;

a third image generation module, configured to generate a second image to be updated for the other frame based on the three-dimensional spatial information and the position information and the category of the target object included in the other frame; and

and the second label adding module is used for adding labels to the image pair formed by the second image to be updated and the other frames.

15. The apparatus of claim 14, further comprising:

a fourth image generation module to generate a second raster image for the other frame based on the second map information and position information of a target object included in the other frame, the second raster image indicating a position of the target object included in the other frame;

wherein the third image generation module is specifically configured to: adjusting the second raster image based on the three-dimensional spatial information and the category of the target object included in the other frames to obtain the second image to be updated,

16. The apparatus of claim 9, further comprising:

and the map information determination module is used for determining first map information corresponding to the target live-action image in the off-line map based on the posture of the image acquisition device aiming at the target live-action image.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 8.

20. A method of updating a high-precision map, comprising:

determining an image corresponding to the acquired live-action image in the high-precision map, and acquiring a third image to be updated;

inputting the collected live-action image into a first feature extraction network of a target detection model to obtain first feature data;

inputting the third image to be updated into a second feature extraction network of the target detection model to obtain second feature data;

inputting the first characteristic data and the second characteristic data into a target detection network of the target detection model to obtain the update information of the acquired live-action image relative to the third image to be updated; and

updating the high-precision map based on the update information,

wherein the target detection model is trained based on the sample image pair generated by the method of any one of claims 1-8.