CN113514053B

CN113514053B - Method and device for generating sample image pair and method for updating high-precision map

Info

Publication number: CN113514053B
Application number: CN202110793044.4A
Authority: CN
Inventors: 何雷; 宋适宇
Original assignee: Apollo Intelligent Technology Beijing Co Ltd
Current assignee: Apollo Intelligent Technology Beijing Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2024-03-26
Anticipated expiration: 2041-07-13
Also published as: CN113514053A

Abstract

The present disclosure provides a method, apparatus, electronic device, and storage medium for generating a sample image pair. Relates to the field of artificial intelligence, in particular to the fields of computer vision, intelligent transportation, automatic driving and deep learning. The method for generating the sample image pair comprises the following steps: determining position information of a target object included in the live-action image based on first map information corresponding to the live-action image; determining a random class combination of target objects included in the live-action image based on the predetermined class; determining the position information of the deleted object aiming at the live-action image based on the position information of the target object included in the live-action image; and generating a first image to be updated for the live-action image based on the position information of the deleted object, the position information of the target object included in the live-action image, and the random class combination, and tagging an image pair composed of the first image to be updated and the live-action image. Wherein the predetermined categories include an add category and an unchanged category.

Description

Method and device for generating sample image pair and method for updating high-precision map

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision, intelligent transportation, autopilot, and deep learning technologies, and more particularly to a method, apparatus, electronic device, and storage medium for generating an image sample pair, and a method of updating a high-precision map.

Background

With the development of computer technology and network technology, automatic driving technology and intelligent navigation technology are gradually mature. Both techniques rely on high-precision maps (High definition map) that contain rich environmental information. The high-precision map may represent a traffic topology consisting of roads, traffic lights, etc.

With the development of cities and the change of traffic planning, the generated high-precision map needs to be updated continuously, so that the high-precision map can represent an actual traffic topological structure, support is provided for services provided by an automatic driving technology and an intelligent navigation technology, and the user experience of the provided services is improved.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, and storage medium to generate a sample image pair to train a target detection model based on the generated sample image pair such that the target detection model can detect changes in the image.

According to one aspect of the present disclosure, there is provided a method of generating a sample image pair, comprising: determining the position information of a target object included in the target live-action image based on first map information corresponding to the target live-action image; determining a random class combination of target objects included in the target live-action image based on the predetermined class; determining the position information of the deleted object aiming at the target live-action image based on the position information of the target object included in the target live-action image; and generating a first image to be updated for the target live-action image based on the position information of the deleted object, the position information of the target object included in the target live-action image and the random class combination, and adding a label to an image pair formed by the first image to be updated and the target live-action image, wherein the label indicates actual update information of the target live-action image relative to the first image to be updated, and the predetermined class comprises an addition class and a non-change class.

According to another aspect of the present disclosure, there is provided an apparatus for generating a sample image pair, comprising: the first position determining module is used for determining the position information of a target object included in the target live-action image based on first map information corresponding to the target live-action image; the category combination determining module is used for determining a random category combination of the target objects included in the target live-action image based on the preset category; the second position determining module is used for determining the position information of the deleted object aiming at the target live-action image based on the position information of the target object included in the target live-action image; the first image generation module is used for generating a first image to be updated aiming at the target live-action image based on the position information of the deleted object, the position information of the target object included in the target live-action image and the random class combination; and a first tag adding module for adding a tag to an image pair composed of a first image to be updated and a target live-action image, wherein the tag indicates actual update information of the target live-action image relative to the first image to be updated, and the predetermined category comprises an adding category and a no-change category.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of generating a sample image pair provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a method of updating a high-precision map, including: determining an image corresponding to the acquired live-action image in the high-definition map to obtain a third image to be updated; inputting the acquired live-action image into a first feature extraction network of a target detection model to obtain first feature data; inputting the third image to be updated into a second feature extraction network of the target detection model to obtain second feature data; inputting the first characteristic data and the second characteristic data into a target detection network of a target detection model to obtain update information of the acquired live-action image relative to a third image to be updated; and updating the high-precision map based on the update information. The object detection model is trained based on a sample image pair generated by the method of generating a sample image pair provided by the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a method of generating a sample image pair according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of generating a first image to be updated for a live image in accordance with an embodiment of the present disclosure;

fig. 3 is a schematic diagram of determining location information of a deletion object according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a method of generating a sample image pair based on live video in accordance with an embodiment of the present disclosure;

FIG. 5 is a flow diagram of a training method for a target detection model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the structure of an object detection model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of the structure of an object detection model according to another embodiment of the present disclosure;

FIG. 8 is a schematic diagram of the structure of an object detection model according to another embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the structure of an object detection model according to another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a parallel cross-difference unit according to an embodiment of the present disclosure;

FIG. 11 is a flowchart of a method of determining update information for an image using an object detection model according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an apparatus for generating a sample image pair according to an embodiment of the present disclosure;

FIG. 13 is a block diagram of a training apparatus of an object detection model according to an embodiment of the present disclosure;

FIG. 14 is a block diagram of an apparatus for determining update information of an image using an object detection model according to an embodiment of the present disclosure; and

fig. 15 is a block diagram of an electronic device for implementing the methods provided by embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Both autopilot technology and intelligent navigation technology rely on high-precision maps that contain rich environmental information. The environmental information includes, for example, lane information, crosswalk information, position information of traffic lights, position information of intersections, and the like. High-precision maps are an important source of a priori knowledge, and their ability to reflect up-to-date changes in the real world must be adequately maintained by constantly updating iterations. The real world changes may include, for example, installation or removal of an indicator light, movement of the location of a portable traffic light, etc.

In the related art, tasks related to map updating have tasks of detecting scene changes, and the technical schemes adopted by the tasks are mainly divided into three types. The first category is to make a comparison between a 3D model and a 3D model using a pre-built 3D CAD model and a reconstructed model built by a classical Multi-View Stereo (MVS) method. The method is time-consuming and is only suitable for offline scenes. The second method is to infer the scene change by comparing the newly acquired image with the original three-dimensional model. In particular, the rate of change is extrapolated by comparing the voxel color of the 3D voxel model with the pixel color of the corresponding image. An alternative method of correlation is to identify the changes by re-projecting the new image onto the old image with the help of a given 3D model and to compare the information of the inconsistency between the new and the old image. A third approach is to use an image representing the old state of the scene to make a two-dimensional comparison with an image representing the new state of the scene. The method includes the step of preparing a 2D image in advance.

In addition to detecting changes in the scene, the change detection task of the high-definition map should also identify, for example, the elements in the high-definition map where the changes occur and the type of the changes. A simple method is to use a standard object detector to identify map elements in the image, project the map elements onto the image, correlate the projection with the detection, and finally get the corresponding changes by cross-comparison. Among them, object detection is a classical problem in computer vision. Solutions are largely divided into two categories, namely two-stage methods and one-stage methods. From this simple method, it can be seen that the whole process involves a plurality of steps. Each step has its optimization objective, and therefore, the entire method of detecting changes has difficulty in achieving an overall optimal solution. For example, the target detector typically trades off accuracy against recall by setting a threshold detection confidence score and running a Non-maximum suppression (Non-Maximum Suppression, NMS). The method ignores important prior information of the high-precision map.

In order to achieve the task of high-precision map change detection, the present disclosure provides an end-to-end learning method to directly detect image changes. More specifically, a target detection model is used to detect missing or redundant elements in a high-definition map. In order to add prior information to the high-precision map, elements in the map can be projected onto the image to obtain the image to be updated. Taking the image to be updated and the live-action image as the input of a target detection model, detecting the difference between the features extracted from the two images by the target detection model, and predicting and obtaining missing or redundant elements in the map based on the difference. The method disclosed by the disclosure can be popularized to the task of detecting the change of any object with regular shape, and the method disclosed by the disclosure is not limited to the task.

This is because the form of the change detection (The HD Map Change Detection, HMCD) task of the high-precision map is similar to the target detection problem. The goal is to identify changes in predefined classes of objects (e.g., traffic lights, road guides, speed limits, etc.). The position of the detected object in the image can be described using a two-dimensional Bounding Box (Bounding Box) and assigned the correct modification class, which may include addition, deletion, no change, etc. For example, the object with the to_add attribute is an object that is lost (should be added) to the high-precision map, the object with the to del attribute is a redundant object (should be deleted) in the high-precision map, and the object with the correct attribute is an object that is unchanged in the high-precision map. Formally, for an online HMCD task that uses a single image as input, the problem solved can be expressed using the following formula:

D _k ＝f _θ (M，I _k ，T _k ，K)。

wherein I is _k Is the kth image frame in the video stream, T _k The global camera pose estimated by a positioning system in an automatic driving automobile is represented by K, an internal reference matrix of an image acquisition device and M, a high-precision map. Dk is a set of two-dimensional bounding boxes with corresponding variation categories, which are predicted by the HMCD predictor f _θ Based on a set of learnable parameters θ. Wherein the HMCD predictor may be the object detection model provided by the present disclosure.

A method of generating a sample image pair for training the object detection model will be described below with reference to fig. 1 to 4.

Fig. 1 is a flow diagram of a method of generating a sample image pair according to an embodiment of the present disclosure.

As shown in fig. 1, the method 100 of generating a sample image pair of this embodiment may include operations S110 to S150.

In operation S110, position information of a target object included in a target live-action image is determined based on first map information corresponding to the target live-action image.

According to an embodiment of the present disclosure, the target live-action image may be, for example, an image photographed in real time during the running of the vehicle. The vehicle may be, for example, an autonomous vehicle, which may be equipped with, for example, an image acquisition device (for example, a camera) via which real-time images are acquired. The real-time image may be a separate image or may be a key frame in the acquired video data.

According to embodiments of the present disclosure, the vehicle may also be configured with a global navigation satellite system (Global Navigation Satellite System, GNSS), for example. When the image acquisition device acquires the target live-action image, the high-precision map image positioned by the global navigation satellite system can be used as the first map information. The high-precision map may be a map updated up to date. The high-definition map has therein location information of a target object, such as the aforementioned predefined object class. The position information may be global positioning information or may be two-dimensional plane coordinate information with respect to a high-precision map image converted from the global positioning information. The conversion rule is similar to that adopted in the navigation system in the related art, and will not be described in detail herein.

In an embodiment, taking the target object as the traffic indicator as an example, through operation S110, the position information of the traffic indicator in the collected target live-action image may be obtained. In an embodiment, the position information may be represented by position information of a bounding box for the target object, for example. For example, the position information of the target object includes the center point coordinates of the bounding box, and the width and height of the bounding box.

According to the embodiment of the disclosure, a live-action image meeting a predetermined position constraint condition can be selected from a plurality of acquired live-action images, and a target live-action image of a generated sample is obtained. The sample may also be generated based on a pre-acquired live-action image of the target. For example, an image satisfying the predetermined position constraint condition may be acquired from a live-action image library as the target live-action image. The predetermined position constraint condition may include that a distance between the image acquisition device and the target object in the real scene is smaller than or equal to a predetermined distance. For example, the constraint condition may be that a distance between a target object in a real scene and a center of the image capturing apparatus is 100m or less. Alternatively, the predetermined position constraint may include that an angle between a direction of the image capturing device and a direction of a reverse normal of the target object is not greater than a predetermined angle. The predetermined angle may be, for example, a small value of 30 ° or the like. By limiting the predetermined position constraint condition, the definition of the target object in the target live-action image of the generated sample can be improved. Therefore, the accuracy of the target detection model obtained based on the generated sample training can be improved, and the updating accuracy of the high-precision map can be improved.

According to an embodiment of the present disclosure, first map information corresponding to a target live-action image in an offline map (e.g., a high-definition map) may be determined based on a pose of an image capture device for the target live-action image. For example, a region of interest (Region of Interest, ROI) may be located from a high-definition map based on a global pose (e.g., including a position and a direction, etc.) of the image capturing device at the time of capturing a live-action image of a target, and a two-dimensional image of the region of interest may be taken as the first map information. In this way, the obtained first map information can be more matched with the target live-action image, and thus the accuracy of the obtained target object position information is improved.

In operation S120, a random class combination of the target object included in the target live-action image is determined based on the predetermined class.

According to an embodiment of the present disclosure, the target object element may be queried based on the previously determined first map information. When the high-precision map is the latest updated map, the queried target object element can represent the target object included in the target live-action image, so that the position information and the number of the target object can be obtained. For generating the sample image pair, the category of any target object in the target live-action image may be set, and the target object may be added to the non-updated high-precision map or may be unchanged from the non-updated high-precision map.

In the HMCD task, the predetermined category may include an add category and an unchanged category. If the target object in the target live-action image is one, the random class combination may be: the target object is an add category or the target object is an unchanged category. If the target objects in the target live-action image are n, n is an integer greater than 1, then the class combination of the n target objects can have 2n cases, and this embodiment can be implemented from the 2 ⁿ One of the cases is randomly selected as a random class combination.

In operation S130, position information of a deletion object for the target live-action image is determined based on position information of the target object included in the target live-action image.

According to the embodiments of the present disclosure, it is possible to take an arbitrary region other than a region represented by the position information of the target object in the target live-action image as the position of the deletion object, and take the position information of the arbitrary region as the position information of the deletion object. Specifically, the position coordinates of any one of the areas other than the area indicated by the position information of the target object can be stored as the position information of the deletion object for the target live-action image. The position information of the deletion object is similar to the position information of the target object described above, and may include the center position coordinates of the arbitrary area and the height and width of the arbitrary area. The arbitrary region may be a region occupied by the bounding box of the deletion object.

In operation S140, a first image to be updated for the target live-action image is generated based on the position information of the deletion object, the position information of the target object included in the target live-action image, and the random category combination.

In operation S150, a label is added to an image pair composed of a first image to be updated and a target live-action image.

According to the embodiments of the present disclosure, a base image having pixel values of 0 each of the same size as the target live-action image can be generated. Based on the random class combination, a target object of a no-change class is determined. And then adding an interested region capable of indicating the position of the object into the basic image based on the position information of the deleted object and the position information of the target object in the unchanged category to obtain a Mask (Mask) image, and taking the Mask image as a first image to be updated. Thus, the object added with the category in the target live-action image is the object added relative to the first image to be updated. The object with no change category in the target live-action image is the object which is unchanged relative to the first image to be updated. The interested area indicating the position of the deleted object in the first image to be updated is the area where the object deleted by the target live-action image relative to the first image to be updated is located.

Based on the above, the embodiment can form an image pair by the target live-action image and the first image to be updated aiming at the target live-action image, and obtain the live-action image and the image to be updated of the input target detection model. And labeling the image pairs based on the position information of the target objects of all the categories to obtain image pairs with bounding boxes indicating the target objects, and correspondingly labeling the categories of the target objects as the categories of the bounding boxes to finish the labeling operation of the image pairs to obtain the image pairs with labels. The labeled image pair may be used as a sample image pair. For example, the added label may be embodied by a bounding box added for the live-action image in the image pair and a category added for the bounding box, such that the added label may indicate actual update information of the live-action image relative to the image to be updated. By adding the tag, supervised training of the target detection model can be achieved. Specifically, the target detection model may be trained based on the difference between the predicted update information obtained by the target detection model and the actual update information.

As can be seen from the foregoing, when the embodiment of the disclosure constructs a sample image pair, the location information of the target object is obtained by locating based on the high-precision map, the location information of the deleted object is calculated based on the locations of the target objects of the added category and the unchanged category, and the category combination of the target object is randomly determined, so that the automatic generation of the image to be updated and the label can be realized, without manual pre-labeling, and the labor cost for generating the sample image pair can be reduced. Moreover, as the image to be updated in the sample image pair can be automatically generated without recall in advance, the situation that the sample image pair is difficult to collect due to sample sparseness can be avoided, the difficulty of generating the sample image pair is reduced, and the training precision of the target detection model is improved conveniently.

Fig. 2 is a schematic diagram of generating a first image to be updated for a live image in accordance with an embodiment of the present disclosure.

According to the embodiment of the present disclosure, the first image to be updated may be obtained in such a manner that the high-precision map image is converted into the raster image. Because each pixel position of the high-precision map image is priori information, compared with the technical scheme of obtaining the first image to be updated based on the live-action image, the embodiment can improve the efficiency and the precision of generating the first image to be updated.

For example, the embodiment may first generate a first raster image for the target live-action image based on the first map information and the position information of the target object included in the target live-action image. The first raster image may indicate a location of a target object included in the target live-action image. Specifically, as shown in fig. 2, in this embodiment 200, a high-definition map image 210, which is first map information, may be first converted into a first raster image 220 based on the position information of the target object. In the first raster image 220, other areas except for the area corresponding to the target object are filled with black having pixel values of 0. The region corresponding to the target object is filled with white with pixel values of 255.

After obtaining the first raster image 220, the operation of generating a first image to be updated may include: and adjusting the first raster image based on the position information of the deleted object and the random class combination to obtain a first image to be updated. For example, the pixel color of the region corresponding to the deletion object in the first raster image may be changed to white based on the position information 230 of the deletion object to indicate the position of the deletion object. The target object of the added category may be determined according to the random category combination, and based on the position information of the target object of the added category (i.e., the position information 240 of the added object), the pixel color of the area corresponding to the added object in the first raster image is changed from white to black, so as to remove the indication information of the position of the added object, and obtain the first image to be updated 250. For example, if the region of the first raster image 220 in which the third pixel color from left to right is white indicates the position of the addition object, the pixel color of the region in the obtained first image 250 to be updated is black. If the region corresponding to the deletion object is located on the right side of the region of white color of the fifth pixel from left to right in the first raster image 220, a white region is newly added to the right side of the region of white color of the fifth pixel in the obtained first image 250 to be updated. The resulting first image to be updated 250 may indicate the location of the deleted object in the target live-action image and the location of the target object of the unchanged object class in the target live-action image.

Fig. 3 is a schematic diagram of a method of generating a sample image pair based on a live video in accordance with an embodiment of the present disclosure.

According to the embodiment of the disclosure, before the image sample pair is generated, the position information of the target objects included in the plurality of target live-action images with equal sizes can be counted to obtain the position distribution information of the target objects as the predetermined position distribution information. When a sample pair is generated, the position of the deletion object is located based on the predetermined position distribution information. Therefore, the obtained position information of the deleted object can be more in line with the actual scene, and the learning capacity and model precision of the target detection model can be improved conveniently.

As shown in fig. 3, this embodiment 300 may recall a plurality of target live-action images from a live-action image library based on the aforementioned predetermined location constraints. In order to facilitate feature extraction of the target detection model and comparison of the live-action image and the image to be updated, the sizes of the target live-action images are equal to each other, and the sizes of the target live-action images are equal to the sizes of the corresponding image to be updated. The method for determining the position information of the target object based on the foregoing can obtain map information corresponding to each of a plurality of target live-action images, and obtain the position information of the target object included in each live-action image. Based on this, a raster image for each live image may be generated, resulting in a plurality of raster images 310. By counting the position information of the target object included in the plurality of target live-action images, for example, a position distribution density map 320 of the target object can be generated. The position distribution density map 320 is set as predetermined position distribution information.

The embodiment may determine the position information of the deletion object by determining an area having a distribution density greater than a predetermined density based on the predetermined position distribution information. Specifically, a region having a distribution density greater than a predetermined density may be determined based on the predetermined position distribution information, and mapped to a first region in the first raster image for the target live-action image 340, with the first region being a candidate region.

Illustratively, the candidate region may be represented based on the background image 330 as shown in fig. 3, considering that the respective target live-action images are equal in size. The size of the background image 330 is equal to the size of the first raster image. The candidate region 331 is obtained by mapping the region having the distribution density greater than the predetermined density into the background image 330.

In generating the sample image pair based on the target live-action image 340 of the plurality of target live-action images, the position information 350 of the target object may be obtained first according to the first map information corresponding to the target live-action image 340. Based on the position information, a position profile 360 of the target object in the target live-action image can be obtained. The position profile 360 is equal in size to the background image 330 and the target live-action image 340.

After the candidate region is obtained, a second region may be removed from the candidate region, and the position information of the deletion object may be determined based on the other regions outside the second region. Wherein the second region is a region indicating the position of the target object included in the target live-action image 340. For example, as shown in fig. 3, as the second region, a region overlapping with a region representing the target object A, B, C in the position map 360 may be used as the candidate region 331. After the second region is removed from the candidate region, a background image 370 indicating the other region may be obtained. The embodiment may select an arbitrary region capable of accommodating one target object from other regions, and take the position information of the arbitrary region as the position information of the deletion object for the target live-action image 340. For example, the position of the dot 371 in the background image 370 may be the center position of the arbitrary region.

According to the embodiments of the present disclosure, the size of the deletion object may be based on the average value of the sizes of the target objects included in the target live-action image. This is because the newly updated high-definition map does not have the deletion object, and the size of the deletion object cannot be determined directly based on the first map information corresponding to the target live-action image. In this way, an arbitrary region capable of accommodating the deletion object can be determined based on the size of the deletion object. In this way, the accuracy of the determined position information of the deletion object can be improved.

For example, the target size may be determined based on the size of the aforementioned second region and the number of target objects included in the target live-action image. And determining any area with the size equal to the target size in the other areas to obtain the position information of the deleted object. The target size is the size of the deleted object, and is the average value of the sizes of the target objects included in the target live-action image. For example, in embodiment 300, the target object included in the target live-action image includes A, B, C. The size of the second region is the sum of the widths and the sum of the heights of the candidate region 331 occupied by A, B, C. Dividing the sum of the widths and the sum of the heights by the number 3 of the target objects to obtain the widths and the heights in the target size.

Based on the above-described flow, the position information of the target object and the position information of the deletion object included in the target live-action image 340 can be obtained. For ease of understanding, this embodiment provides a position image 380, which position image 380 may indicate the position of the target object A, B, C and the position of the deletion object D. The first raster image for the target live-action image 340 may then be adjusted based on the category of the target object in the target live-action image and the deletion category of the deleted object. For example, in A, B, C, D, B is an added category, A, C is an unchanged category, and D is a deleted category. This embodiment should delete the information indicating the position of B in the first raster image, add the information indicating the position of D, and thereby obtain the first image 390 to be updated for the target live-action image 340. The first image to be updated 390 and the target live-action image 340 may constitute an image pair. Subsequently, the position of A, B, C, D may be marked in the first image to be updated 390 and the target live-action image 340 based on the position of A, B, C, D, and a label indicating the object class is marked for the position, so that the addition of the label is completed, and a sample image pair is obtained.

Fig. 4 is a schematic diagram of a method of generating a sample image pair based on a live video in accordance with an embodiment of the present disclosure.

Video frames in the acquired video data may also be employed to generate sample image pairs in accordance with embodiments of the present disclosure. Each video frame in the video data may be used as a target live-action image to generate a sample image pair comprising the each video frame using the method of generating a sample image pair described previously.

According to an embodiment of the present disclosure, when the pair of sample images is generated using video data, for example, a target frame may also be extracted from the video data first, and the target frame is used as a target live-action image to generate a pair of sample images including the target frame. Then, based on the position information and the category of the target object included in the target frame, the position information and the category of the target object included in other video frames except the target frame in the video data are deduced. This is because video data is composed of a plurality of video frames that are consecutive, and the plurality of video frames are not independent from each other. Accordingly, the target live-action image described previously may include a target frame in the live-action video. The target Frame may be, for example, a Key Frame (Key Frame) in the live-action video. In order to facilitate understanding, the video data may be segmented in advance, so that the target objects included in each video frame in each live-action video obtained by segmentation are the same.

Illustratively, as shown in fig. 4, the process of generating a sample image pair based on a live-action video in this embodiment 400 may include operations S410 to S440. In operation S410, a target frame is first selected, and a tag 401 for the target frame is generated. The tag 401 may include a tag for a target object, which may include at least one of three categories. Three ofThe categories are the deletion category, the addition category and the unchanged category. In an embodiment, the live-action video may include M frames of images, M being a natural number greater than 1, and the kth video frame Ok is selected from the live-action video as the target frame. Setting the target frame includes adding a target object a of a category ₁ And target object a of unchanged class ₂ And there is a target object a with deleted deletion category with respect to the first image to be updated ₃ . The resulting tag 401 includes the target object a indicating the added category ₁ Indicating a target object a of a non-changing class ₂ Is defined by a bounding box and a target object a indicating a deletion category ₃ Is a bounding box of (c).

Target object a in the added category ₁ And target object a of unchanged class ₂ For reference, operation S420 may be performed to add a tag for an unchanged object and a tag for an added object to other frames in the live-action video than the target frame. For example, the method of determining the position information of the target object included in the target live-action image in the foregoing may be adopted, and the position information of the target object included in the other frame may be determined based on the second map information corresponding to the other frame. The second map information is similar to the first map information described above, and will not be described again. Then, the categories of the target objects included in the other frames are determined based on the categories of the target objects included in the target frames. Based on the category and location information of the target object included in the other frame, a tag for the unchanged object and a tag for the added object can be added to the other frame.

Illustratively, the target object A of the three-dimensional space in the high-precision map may be determined based on the conversion relationship of the two-dimensional space and the three-dimensional space adopted in generating the high-precision map ₁ And A ₂ Is projected to obtain A in video frame ₁ And A ₂ Is provided.

Since no object is deleted in the newly updated high-precision map. The embodiment may perform three-dimensional reconstruction on the deletion object through operation S430 to obtain three-dimensional spatial information of the deletion object. For example, three-dimensional spatial information of the deletion object may be determined based on the depth of the target object included in the target frame and the position information of the deletion object to realize three-dimensional reconstruction of the deletion object based on the three-dimensional spatial information. The average value of the depth of the target object included in the target frame may be first set as the depth of the deletion object. And then obtaining the three-dimensional space information of the deleted object based on the depth of the deleted object, the position information of the bounding box of the deleted object and the positioning information of the image acquisition device when the target frame is acquired. When obtaining the three-dimensional space information, for example, an internal reference matrix of the image acquisition device needs to be considered.

In one embodiment, three-dimensional spatial information P _{to_del} The following formula can be used:

P _{to_del} ＝R ^-1 [K- ¹ (d _{to_del} ×pto _{_del} )-t]。

Wherein R is a rotation matrix of the image acquisition device, t is a translation matrix of the image acquisition device, and d _{to_del} To delete the depth of an object, p _{to_del} To delete the position information of the bounding box of the object.

After obtaining the three-dimensional space information, the embodiment may generate a second image to be updated for the other frame based on the three-dimensional space information, the position information and the category of the target object of the other frame, and tag the image pair composed of the second image to be updated and the other frame.

According to an embodiment of the present disclosure, the method of generating the second image to be updated is similar to the method of generating the first image to be updated described above. For example, before the second image to be updated is generated, a second raster image for other frames may be generated based on the second map information and the position information of the target object included in the other frames. The second raster image indicates a position of a target object included in the other frame. The second raster image generation method is similar to the method of generating the first raster image described above, and the second raster image may be performed before operation S430 or after operation S430. After the second raster image is obtained, the second raster image can be adjusted based on the three-dimensional space information and the category of the target object included in other frames to obtain a second image to be updated. The method is similar to the method of adjusting the first raster image described above. The difference is that this embodiment needs to perform two-dimensional projection on the reconstructed three-dimensional deletion object before adjusting the second raster image (operation S440), to obtain the position information of the deletion object in other frames. The resulting second image to be updated may indicate the position of the deletion object in the other frame and the position of the target object of the unchanged class in the other frame.

Similarly, an image pair composed of the second image to be updated and the other frames may be tagged based on the obtained position information of the deletion object, the position information of the target object included in the other frames, and the category, in a similar manner to the aforementioned method of tagging an image pair composed of the first image to be updated and the target live-action image.

When the video data includes a plurality of video frames in addition to the target frame, a sample image pair can be obtained for each video frame by adopting the method described above. In this way, based on one video data, a sample image pair set composed of a plurality of sample image pairs can be obtained.

According to the embodiment of the disclosure, the deleted object is subjected to three-dimensional reconstruction based on the deleted object of the target frame, and the sample image pair is generated based on the obtained three-dimensional space information, so that the consistency of the position of the deleted object in each video frame in the live-action video can be ensured, the efficiency of generating the label of the deleted object is improved to a certain extent, and the cost of generating the sample image pair based on the live-action video is reduced.

According to an embodiment of the present disclosure, based on the foregoing method, two sample sets, one including a plurality of Sample Image Sets (SICDs) constructed based on a live-action image of a single frame and one including a plurality of sample image sets (VSCDs) constructed based on video data, may be generated. The real-scene image and the video data can be acquired under the condition of less environmental change so as to comprehensively verify the difference of detection results obtained by adopting different target detection models. In order to realize the test of the target detection model, a test set can be formed based on the image to be updated and the live-action image adopted in the existing high-precision map updating process. The labels of each image pair in the test set may be manually labeled. To further increase sample diversity, this embodiment may collect live-action images and video data in multiple cities.

According to the embodiment of the disclosure, when the image pairs are marked based on the position information and the category of the target object and the deleted object, for example, bounding boxes with different colors can be added in the image pairs for different types of objects, so that the method of the embodiment of the disclosure is popularized to the change detection of multiple types of objects. Experiments show that when the image is subjected to rasterization in other detection tasks, the selection of bounding boxes with different colors does not have obvious influence on the detection result.

After the sample set is constructed, the target detection model can be trained based on the sample image pair in the sample set.

The training method of the object detection model provided by the present disclosure will be described in detail below with reference to fig. 5 to 9.

Fig. 5 is a flow chart of a training method of a target detection model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 of the object detection model of this embodiment may include operations S510 to S540. The object detection model includes a first feature extraction network, a second feature extraction network, and an object detection network.

In operation S510, a first live-action image in a sample image pair is input into a first feature extraction network to obtain first feature data.

Wherein the sample image pair is generated by the method described above, the sample image comprising a live image and an image to be updated. And the sample image pair has a label indicating the actual update information of the live image relative to the image to be updated. The actual update information includes update category and update location information. The update category includes an add category, a delete category, and a no change category, and the update location information is represented by the location of the bounding box of the target object.

According to an embodiment of the present disclosure, the first feature extraction network may comprise a convolutional neural network or a depth residual network, or the like. For example, convolutional neural networks may include a DarkNet-53 network in a single view detector v3 (You Only Look Once V, yolo v 3) to better balance accuracy and inference speed. It will be appreciated that the types of first feature extraction networks described above are merely examples to facilitate an understanding of the present disclosure, and the present disclosure is not limited thereto.

In operation S520, a first image to be updated in the sample image pair is input to the second feature extraction network to obtain second feature data.

According to an embodiment of the present disclosure, the second feature extraction network may comprise a convolutional neural network. Wherein, considering that the first image to be updated is the grid image, the network structure of the second feature extraction network is simpler than that of the first feature extraction network, and therefore, the network structure of the second feature extraction network is simpler than that of the first feature extraction network. For example, the second feature extraction network may include a shallow convolutional neural network with a step size stride of 2 and a layer number of 11, where the convolutional kernel may be 3×3 in size to increase the receptive field of the raster image.

In operation S530, the first feature data and the second feature data are input into the target detection network, and prediction update information of the first live-action image with respect to the first image to be updated is obtained.

According to embodiments of the present disclosure, the target detection network may, for example, employ an anchor (anchor) -based framing detection method to obtain the prediction update information. For example, the object detection network may first splice the first feature data and the second feature data, and then obtain the prediction update information based on the spliced feature vector. The object detection network may be a first order detection network or a second order detection network with the exception of the feature extraction structure.

The prediction update information may include predicted location information of the target object (which may be represented by a location of a bounding box of the target object) and a category of the target object. Wherein the class of the target object may be represented by the probability that the target object belongs to the added class, the deleted class, and the unchanged class.

In operation S540, the target detection model is trained based on the actual update information and the predicted update information.

According to embodiments of the present disclosure, a gradient descent algorithm or a back propagation algorithm or the like may be employed to train the target detection model based on the difference between the actual update information and the predicted update information. For example, the value of the predetermined loss function may be determined based on the actual update information and the predicted update information, followed by training the target detection model based on the value of the predetermined loss function. Specifically, the value of a parameter in the target detection model when the predetermined loss function is minimum can be determined, and the target detection model is adjusted based on the value of the parameter.

The predetermined loss function may be constituted by a class loss function and a positioning loss function, for example. The class Loss function may be a cross entropy Loss function or a focus Loss (Focal Loss) modified cross entropy Loss function, etc., and the positioning Loss function may be an average absolute error function, a mean square error Loss function, etc.

In an embodiment, the object detection network may employ a YOLO v3 network, and the obtained prediction update information may further include a confidence level, which is used to represent the probability that the bounding box has an object. The predetermined penalty function may include not only the category penalty function and the location penalty function, but also a confidence penalty function in an effort to improve the performance of the complex classification. The confidence loss function is divided into two parts, one part is targeted confidence loss and the other part is non-targeted confidence loss. In this way, when determining the value of the predetermined loss function, the value of the positioning loss function may be determined based on the actual position information and the predicted position information indicated by the label of the sample image pair, so as to obtain the first value. And determining the value of the confidence coefficient loss function based on the actual position information and the confidence coefficient to obtain a second value. And determining the value of the class loss function based on the actual class and the predicted class to obtain a third value. And finally taking the weighted sum of the first value, the second value and the third value as the value of the preset loss function.

Illustratively, the predetermined loss function L may be expressed using the following formula:

L＝λ ₁ L _GIoU +λ ₂ L _conf+ λ _3Lprob 。

wherein L is _GIoU To locate the loss function, L _conf L is a confidence loss function _pro b _{Is that} Class loss function. Lambda (lambda) ₁ 、λ ₂ 、λ ₃ Respectively in the direction L _GIoU 、L _conf And L _{prob fraction} And (5) matching weights. The weight values may be set according to actual requirements, for example, may be set to 1, which is not limited in the present disclosure.

In an embodiment, the positioning loss function may, for example, use the cross-ratio loss as a local measure, so as to improve the positioning accuracy of the predicted bounding box, and in particular, increase the attention to the non-overlapping bounding box. For example, the positioning loss function may be expressed using the following formula:

wherein Q represents the number of bounding boxes, di represents the ith bounding box in the prediction update information,indicating the i-th bounding box in the actual updated information,/->Represents Di and->Is provided (minimum enclosing convex hull). />Representation D _i And->Intersection area of->Represents Di and->Is a union area of (a) and (b).

In one embodiment, the confidence loss function may be expressed using the following formula:

wherein S is ² Is the number of grid cells in the live-action image and B is the number of anchor frames within the grid cells.Indicating whether the jth boundary box in the ith unit has a prediction target object, if so, taking a value of 1, otherwise taking a value of 0./ >The value of (2) and->And if the target object is not predicted, the value is 1, otherwise, the value is 0.f (f) _ce () Represents cross entropy->Representing the confidence level of the jth bounding box in the ith grid cell,/for the jth bounding box>Representing the true confidence of the jth bounding box in the ith grid cell. The true confidence level may be determined based on actual location information of the target object. Specifically, if it is determined that there is a target object in the ith grid cell according to the actual position information, the true value confidence is 1, otherwise, it is 0. Alpha and gamma are focus loss parameters, the value of which can be set according to actual requirements, for example, alpha and gamma can be respectively set to 0.5 and 2, the focus loss parametersThe disclosure is not limited in this regard.

In one embodiment, the class loss function may be expressed, for example, using the following formula:

the class may include a correct indicating no change class, a to_add indicating an add class, and a to_del indicating a delete class.Indicating whether the object is present in the ith grid cell, if so, the value is 1, otherwise, the value is 0./>Representing the predicted probability that the target object in the ith grid cell belongs to the b-th class. />Representing the actual probability that the target object in the ith grid cell belongs to the b-th class. If the ith grid cell has a target object of the b category, the value of the actual probability is 1, otherwise, the value is 0.

In summary, embodiments of the present disclosure may enable an object detection model to have the ability to detect image changes by training the object detection model based on a sample image pair. The input of the trained target detection model is a grid image obtained by converting a live-action image and a high-precision map to be updated, and the target detection model can directly output predicted updating information. Thus, end-to-end detection of image changes can be achieved. Compared with the related art, the method can realize the optimal solution of the whole detection task, thereby improving the detection precision. Furthermore, because the input image comprises the grid image obtained by converting the high-precision map, the prior information of the high-precision map can be effectively considered, and the accuracy of the target detection model obtained by training can be further improved.

Fig. 6 is a schematic structural diagram of an object detection model according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, the object detection network may make predictions of the update information based on differences between the first feature data and the second feature data. Compared with the technical scheme of predicting updated information according to the spliced data of the two characteristic data, the prediction accuracy can be improved to a certain extent.

Accordingly, the aforementioned object detection network may comprise a parallel cross difference unit and a feature detection unit. The parallel cross difference unit is used for calculating the difference between the first characteristic data and the second characteristic data. The feature detection unit is used for predicting the change of the target object of the live-action image compared with the target object indicated by the raster image according to the difference between the two feature data.

Illustratively, as shown in FIG. 6, the object detection model in this embodiment 600 includes a first feature extraction network 610, a second feature extraction network 620, and an object detection network 630. The first feature extraction network 610 may be composed of, for example, three first convolution layers having 32, 64, and 128 channels, respectively, to sequentially process the first live-action image 601 and sequentially obtain feature data The second feature extraction network 620 may be composed of, for example, three second convolution layers having 32, 64, and 128 channels, respectively, to sequentially process the first image 602 to be updated and sequentially obtain feature dataWherein the convolution kernel of the first convolution layer may be greater than the convolution kernel of the second convolution layer. This is because the first image to be updated is a raster image, and the information contained in the first live-action image 601 is more than the information contained in the first image to be updated, and by this arrangement, the accuracy of feature extraction of the first live-action image 601 can be ensured. The object detection network 630 may include parallel cross differences The (Parallel Cross Difference, PCD) unit 631 and the Feature detection unit 632, the Feature detection unit 632 may employ, for example, a Feature Decoder (FD) of a target detection model in the related art.

Illustratively, the PCD unit 631 may include, for example, a reflexive layer and a fusion layer. The inverting layer is used for inverting one of the first characteristic data and the second characteristic data. The fusion layer can be used for adding the other characteristic data in the first characteristic data and the second characteristic data with the characteristic data obtained by inverting, so as to obtain the parallel cross difference characteristic. The FD unit 632 may process the fused features using an anchor (anchor) -based frame detection method to obtain prediction update information. It will be appreciated that the structure of the PCD unit 631 is merely exemplary to facilitate an understanding of the present disclosure, and the PCD unit 631 of the present disclosure may also calculate, for example, data obtained by subtracting the second characteristic data from the first characteristic data as a first characteristic difference, and calculate data obtained by subtracting the first characteristic data from the second characteristic data as a second characteristic difference. And then the first characteristic difference and the second characteristic difference are spliced to obtain parallel cross difference data. Wherein, for example, a concat () function may be employed to splice features.

Based on this, this embodiment, after obtaining the first characteristic data and the second characteristic data, parallel cross difference data may be obtained using the PCD unit 631 based on the first characteristic data and the second characteristic data. The obtained parallel cross difference data is then input to the FD unit 632, obtaining prediction update information.

For example, the first characteristic data and the second characteristic data may be input to the PCD unit 631, and parallel cross difference data may be output from the PCD unit 631. After the parallel cross difference data is input to the FD unit 632, the FD unit 632 outputs the prediction update information 603.

Fig. 7 is a schematic structural diagram of an object detection model according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, not only the feature extraction units for extracting features may be provided in the first feature extraction network and the second feature extraction network, but also N feature projection layers sequentially connected after the feature extraction units may be provided to project the extracted features to N different dimensions, and prediction of update information may be performed based on the features of the N different dimensions. In this way, the object detection model is facilitated to learn features of different resolutions of the image. Meanwhile, the updated information is predicted based on the features with different resolutions, so that the precision of a prediction result can be improved. Wherein N is an integer greater than 1.

Accordingly, the process of inputting the first live-action image into the first feature extraction network to obtain the first feature data may include an operation of inputting the first live-action image into a feature extraction unit included in the first feature extraction network to obtain the feature data as the first initial data. The method also comprises the following operations: inputting the first initial data into a first projection layer of N feature projection layers included in the first feature extraction network, inputting data output by the first projection layer into a second projection layer, and the like until data output by the (N-1) th feature projection layer is input into the N th feature projection layer, wherein each projection layer in the N feature projection layers can output feature data of a first live-action image, and N data are obtained. For example, the i-th projection layer outputs i-th data of the first live-action image. The N data constitute first characteristic data.

Similarly, inputting the first image to be updated into the second feature extraction network, the process of obtaining the second feature data may include an operation of inputting the first image to be updated into a feature extraction unit included in the second feature extraction network, and taking the obtained feature data as second initial data. The following operations may also be included: and inputting the second initial data into a first projection layer in N characteristic projection layers included in the second characteristic extraction network, inputting data output by the first projection layer into the second projection layer, and the like until data output by the (N-1) th characteristic projection layer is input into the N th projection layer, wherein each projection layer in the N characteristic projection layers can output characteristic data of a first image to be updated, so as to obtain N data. For example, the jth projection layer outputs the jth data of the first image to be updated. The N data constitute second characteristic data.

Illustratively, as shown in FIG. 7, in this embodiment 700N is 3 for example, the first feature extraction network included in the object detection model includes a feature extraction unit 711, a feature projection layer 712, a feature projection layer 713, and a feature projection layer 714. Wherein the structure of the feature extraction unit 711 is similar to that of the first feature extraction network described in fig. 6, and includes three convolution layers with channel numbers of 32, 64, and 128, respectively, and feature data of the first live-action image 701 can be extracted respectively Characteristic data->By processing the feature projection layer 712, feature data +_76×256 can be obtained, for example>Characteristic data->By processing the feature projection layer 713, feature data of size 38 x 512, for example, can be obtained>Characteristic data->By processing the feature projection layer 714, feature data +_1024 can be obtained, for example, with a size of 19×19×1024>Characteristic data->Together constituting first characteristic data. It is to be understood that the above-described feature data are merely examples to facilitate an understanding of the present disclosureThis is not limited.

Similarly, the second feature extraction network included in the object detection model includes a feature extraction unit 721, a feature projection layer 722, a feature projection layer 723, and a feature projection layer 724. Wherein the feature extraction unit 721 has a similar structure to that of the second feature extraction network described in fig. 6, and includes three convolution layers with channel numbers of 32, 64, and 128, and can extract feature data of the first image 702 to be updated respectively Characteristic data->After processing sequentially through the feature projection layer 722, the feature projection layer 723, and the feature projection layer 724, feature data can be obtained, respectivelyThe characteristic data->Together forming second characteristic data. It will be appreciated that the dimensions of the feature projection layers in the second feature extraction network may be equal to the dimensions of the corresponding feature projection layers in the first feature extraction network, such that the feature data +.> The size of (2) is respectively equal to the characteristic data->Is equal in size.

According to the embodiment of the disclosure, after obtaining the feature data of N different dimensions, the target detection network in the target detection model may calculate the difference value for the data of the same dimension in the first feature data and the second feature data, to obtain N difference values. And finally, predicting updated information based on the data obtained after the N difference values are spliced. Therefore, the predicted update information fully considers the characteristics of a plurality of different dimensions, and the problems of missing elements, more redundant elements, localized noise influence and the like existing in the prediction update information based on a pair of characteristic data are avoided.

According to the embodiment of the disclosure, after obtaining the feature data of N different dimensions, the target detection network in the target detection model may perform, for example, one target detection based on the data of the same dimension in the first feature data and the second feature data, and perform N target detections in total. And finally, screening the N target detection results as candidate prediction information to obtain final prediction update information. Accordingly, the target detection network may include N parallel cross difference units, N feature detection units, and an information screening sub-network, and one parallel cross difference unit and one feature detection unit may constitute one detection sub-network, resulting in N detection sub-networks in total. And each detection sub-network performs target detection once to obtain candidate prediction information. The information screening sub-network screens out final prediction information from the obtained N candidate prediction information.

Specifically, the (N-i+1) -th parallel cross difference data can be obtained by adopting the (N-i+1) -th parallel cross difference unit in the detection sub-network based on the i-th data of the first live-action image and the i-th data of the first image to be updated. The (N-i+1) th parallel cross difference data is input into a feature detection unit in the (N-i+1) th detection sub-network, so that the (N-i+1) th candidate prediction information (namely a detection result obtained by detection) can be obtained. And inputting the obtained N candidate prediction information into an information screening sub-network to obtain the prediction update information of the first live-action image relative to the first image to be updated.

As shown in fig. 7, taking N as 3 as an example, 3 PCD units and 3 FD units may constitute a first detection subnetwork 731, a second detection subnetwork 732, and a third detection subnetwork 733. The data output by the eigenprojection layer 724 and the data output by the eigenprojection layer 714 are input to the PCD unit 7311 in the first detection subnetwork 731 to obtain first parallel cross-difference data. The first parallel cross difference data is input to the FD unit 7312 in the first detection subnetwork 731 to obtain first candidate prediction information. Similarly, the data output by the feature projection layer 723 and the data output by the feature projection layer 713 are processed via the PCD unit 7321 and the FD unit 7322 in the second detection sub-network 732, and second candidate prediction information can be obtained. The data output by the feature projection layer 722 and the data output by the feature projection layer 712 are processed by the PCD unit 7331 and the FD unit 7332 in the third detection sub-network 733, and then third candidate prediction information can be obtained. The information filtering sub-network 734 may, for example, employ NMS methods to filter out the final prediction update information 703 from the three candidate prediction information.

Fig. 8 is a schematic structural view of an object detection model according to another embodiment of the present disclosure.

According to the embodiments of the present disclosure, when the prediction of the update information is performed based on the features of different dimensions, for example, the parallel cross difference data obtained based on the features of low dimensions may be transferred to the parallel cross difference unit that obtains the parallel cross difference data based on the features of high dimensions. Therefore, the powerful functions of the deep learning network are fully utilized, the complex problem is well solved, and the model precision is improved.

Illustratively, each of the 1 st to (N-1) th detection subnetworks in the aforementioned N detection subnetworks may further include a feature propagation unit to propagate the respective resultant cross-difference data to a parallel cross-difference unit in a next detection subnetwork.

In an embodiment, when i is smaller than N, the (N-i) -th parallel cross difference data and the i-th data of the first live-action image may be input to the feature propagation unit of the (N-i) -th detection sub-network to obtain the fused data when the (N-i+1) -th parallel cross difference data is obtained. And then inputting the fused data and the ith data of the first image to be updated into a parallel cross difference unit of an (N-i+1) th detection sub-network, and outputting the (N-i+1) th parallel cross difference data by the parallel cross difference unit of the (N-i+1) th detection sub-network. For the case where i is equal to N, the i-th data of the first live-action image and the i-th data of the first image to be updated may be input to the 1 st parallel cross difference unit in the detection sub-network, thereby obtaining 1 st parallel cross difference data. This is because there is no previous detection sub-network for the 1 st detection sub-network. The feature data of the live-action image and the parallel cross difference data are fused, because the feature data of the live-action image can reflect fine granularity features of a target object in a real scene, and the disclosure is not limited to this. For example, in another embodiment, the input data of the feature propagation unit may be the feature data of the image to be updated and the parallel cross difference data, so as to fuse the feature data of the image to be updated and the parallel cross difference data first.

For example, as shown in fig. 8, in this embodiment 800, the first live-action image 801 with size 608×608 is processed by the feature extraction unit 821 and the three feature projection layers 822 to 824 to obtain the 3 rd dataThe first image 802 to be updated with the size of 608 x 608 is processed by the feature extraction unit 811 and the three feature projection layers 812-814 to obtain the 3 rd data +.>The two 3 rd data are input to the input of the PCD unit 8311 in the first detection sub-network 831, the output of the PCD unit 8311 is input to the FD unit 8312 and the first feature propagation (Feature Propagation, FP) unit 8313, and after being processed by the FD unit 8312, the output candidate update information is input to the information filtering sub-network 834. At the same time by the feature data +.>As an input to the FP unit 8313, data and characteristic data outputted to the PCD unit 8311 via the FP unit 8313 +.>The output from the processing is used as input to the PCD unit 8321 in the second detection subnetwork 832. At the same time by outputting the feature data via the feature projection layer 823 +.>As an input to the PCD unit 8321, data and characteristic data outputted to the FP unit 8313 via the PCD unit 8321 +.>The outputs obtained by the processing are input to the FD unit 8322 and the FP unit 8323. Similarly, the candidate update information output by the FD unit 8322 and the candidate update information output by the FD unit 8332 in the third detection sub-network 833 by processing the data output from the PCD8331 can be obtained. The information filtering sub-network 834 may output predicted update information 803 after processing the candidate update information based on the NMS method.

Through the embodiment, the features output by the PCD unit in a coarser granularity can be transmitted through the FP unit to amplify the feature scale to a finer granularity, and then the features are spliced with the live-action image features according to the scale proportion, so that fusion of the features in different dimensions can be realized, and the accuracy of the obtained prediction update information is improved.

According to an embodiment of the present disclosure, the present disclosure uses VSCD datasets to study the impact of PCD units and FP units on detection results based on the target detection model depicted in fig. 6. As shown in table 1 below, diff-Net refers to the model of figure 6 after removal of the PCD unit. The present disclosure is characterized by simply stitching together features extracted from different branches (live image branch and image branch to be updated) as features of downstream units. Wherein the downstream elements may include neither PCD elements nor FP elements, may include only PCD elements (which may constitute the model of fig. 7), and may also include both PCD elements and FP elements (which may constitute the model of fig. 8). Whereby the differences between the features extracted by the different branches are calculated by the downstream units. The structure of the several downstream units was evaluated for performance using the mean average accuracy (Mean Average Precision, MAP) as a measure, and the evaluation results are shown in table 1. From this table 1, it can be seen that introducing a PCD unit can increase MAP by 8%. Re-introducing FP cells on the PCD cells to propagate features from coarser to finer levels may further increase MAP by 7.6%.

TABLE 1

Fig. 9 is a schematic structural view of an object detection model according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, when the live-action image is a video frame in the live-action video, the cyclic neural network unit can be further arranged in the target detection model, so that the correlation between the prediction update information and time can be captured, and the accuracy of the target detection model is improved. This is because in the traffic field of object detection, the collected data is typically a video stream, not a sparse image.

Illustratively, the recurrent neural network element may form a 1 st detection subnetwork with a parallel cross-difference element, a target detection element, and a feature propagation element. Thus, when the 1 st parallel cross difference data is obtained, the nth data of the first live-action image and the nth data of the first image to be updated can be input into the 1 st parallel cross difference unit in the detection sub-network, and the output of the parallel cross difference unit is taken as initial cross difference data. The initial cross difference data and the current state data of the cyclic neural network unit are then input to the cyclic neural network unit, and the 1 st parallel cross difference data is output by the cyclic neural network unit. Wherein the current state data of the recurrent neural network element is obtained by the recurrent neural network giving a previous video frame to the current video frame.

The recurrent neural network element may be a long-short-term memory network element. The state data of the long-short term memory network element includes hidden state data and cell state data. Wherein the hidden state data may be regarded as the 1 st parallel cross difference data. The long and short term memory network element may be implemented using an ELU activation function and a layer normalization function, which is not limited by the present disclosure.

Illustratively, as shown in fig. 9, the object detection model in this embodiment 900 is similar to the object detection model described in fig. 8, except that in the object detection model of this embodiment, the PCD unit 9311, the FD unit 9312, the FP unit 9313, and the convolutional long-short term memory network unit (Conv LSTM) 9314 together constitute a first detection sub-network. It will be appreciated that like reference numerals in fig. 9 and 8 refer to like elements.

In this embodiment, the set sample image pair includes at least two sample image pairs each constituted by two video frames in the live-action video. The two video frames are the (k-1) th video frame and the kth video frame in the live-action video respectively. The (k-1) th video frame as a live-action image p _k-1 902, and the image p to be updated _k-1 901 together form a first sample image pair. The kth video frame is taken as a live-action image pk 902 'and forms a second sample image pair together with the image pk 901' to be updated. In the process of training the target detection model, two images in the first sample image pair can be input into the target detection model in parallel, the current state data Cpk-2903 and Hpk-2904 of the convolution long-short-term memory network unit 9314 and the data output by the PCD 9311 unit are used as the input of the convolution long-term memory network unit 9314, and the updated state data Cp can be obtained _k -1903 'and Hpk-1904'. Wherein, the state data Hpk-1904' can be input as parallel cross difference data to the FD unit 9312 and the FP unit 9313, and the prediction update information 905 is obtained after the subsequent processing. Subsequently, the two images in the second sample image pair may be input to the target detection model in parallel, and the current state data Cpk-1903 'and Hpk-1904' of the convolution long-short-term memory network unit 9314 and the data output by the PCD9311 unit are used as inputs to the convolution long-term memory network unit 9314, so that updated state data Cpk903 "and Htk" may be obtained. Wherein the state data Hpk904″ may be inputted as parallel cross difference data to the FD unit 9312 and the FP unit 9313, and the prediction update information 905' is obtained after the subsequent processing. In the training process, sample image pairs formed by video frames in the live-action video can be sequentially input into the target detection model to obtain a plurality of prediction update information. Performing the object detection model once based on the plurality of prediction update informationAnd (5) optimizing.

Fig. 10 is a schematic diagram of a structure of a parallel cross difference unit according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 10, in this embodiment 1000, the parallel cross difference unit 1010 may include a first inversion layer 1011, a second inversion layer 1012, a first splice layer 1013, a second splice layer 1014, a third splice layer 1015, and a data fusion layer 1016. Accordingly, when the (N-i+1) -th parallel cross difference data is obtained by using the PCD unit in the (N-i+1) -th detection sub-network, the first data input to the PCD unit may be used as the input of the first inversion layer 1011, and the first inversion data may be obtained after processing through the first inversion layer 1011. For example, the first data may be the i-th data of the image to be updated extracted by the feature extraction unit (e.g. ) After processing via the first inversion layer 1011, data can be obtained>Similarly, the second data input to the PCD unit may be processed through the second inversion layer 1012 as input to the second inversion layer 1012, resulting in second inversion data. For example, if i is equal to N, the second data may be nth data of the live-action image. If i is smaller than N, the second data can be fusion data obtained by fusing the ith data and the (N-i) th parallel cross difference data of the live-action image. Data 3 of the live-action image described above with the second data +.>For example, after processing through the second inversion layer 1012, data can be obtained>The first splice data may be obtained with the first data and the second data as inputs to the first splice layer 1013. The second splice number can be obtained by taking the second data and the first inverse data as inputs to the second splice layer 1014According to the above. The first data and the second data are used as inputs to the third concatenation layer 1015, and third concatenation data may be obtained. Finally, the first, second and third splice data are input into a data fusion layer 1016, thereby obtaining (N-i+1) -th parallel cross difference data.

Illustratively, when the PCD unit forms the first detection sub-network, the data output by the data fusion layer 1016 may be input into the recurrent neural network unit, thereby taking the data output by the recurrent neural network unit as the 1 st parallel cross difference data.

Illustratively, as shown in fig. 10, the first feature extraction network and the second feature extraction network may further be provided with a conversion layer after each feature projection layer to respectively extract the feature data obtained by the first feature extraction networkConverting the c channel dimensions into c/2 channel dimensions to obtain characteristic data +.>Feature data obtained by the second feature extraction networkWhere N is 3, d may have values of 4, 5, and 6. The first data is characteristic data +.>The second data is->Or based on +.>And (3) merging data obtained by parallel cross difference data with the (N-i) th data. The size of the convolution kernel in the two conversion layers may be, for example, 3 x 3.

Illustratively, as shown in FIG. 10, the PCD unit 1010 may also include a plurality of convolution layers 1017 connected in sequenceThe data is converted by the channel number in order to output the data fusion layer 1016, so that the obtained feature data can better express the difference between the features of the live-action image and the image to be updated. For example, the number of convolution layers 1017 is 4, and the convolution kernel sizes of the 4 convolution layers are 3×3, 1×1, 3×3, and 1×1, respectively. The 4 convolution layers are respectively used for converting the output of the data fusion layer 1016 from 3c/2 channel dimensions to c channel dimensions, converting the data from c channel dimensions to c/2 channel dimensions, converting the data from c/2 channel dimensions to c channel dimensions, converting the data from c channel dimensions to c/2 channel dimensions, and finally obtaining the characteristic data output by the PCD unit 1010

Wherein the characteristic dataMay be used as inputs to detect FD units 1020 and FP units 1030 in the sub-network. It will be appreciated that the characteristic data +.>As inputs to the recurrent neural network unit, output data of the recurrent neural network unit is as inputs to the FD unit 1020 and FP unit 1030 in the detection sub-network.

As shown in fig. 10, the FD unit 1020 may be provided with a convolution layer 1021 having a convolution kernel size of 3×3 for increasing the channel dimension of input data from c/2 to c. The convolutional layer 1022 is then employed to generate the proposed region, and the convolutional layer 1022 may have a convolutional kernel size of 1 x 1. Thereby obtaining channel data of sxsx [3 x (num) _class +5) tensors. Wherein num is _class Representing the number of target object categories (e.g., 3, categories are added, deleted, and unchanged, respectively, as described above), 5 representing 4 channels of bounding box positions and 1 channel of confidence, 3 representing S ² The number of anchor boxes in the grid cell. The value of S may be 7, for example.

Similar to the YOLO v3 model, the FD unit 1020 has two branches for performing change detection of the target object, one branch outputting a change class of the target object using a softmax operation, and the other branch for estimating a geometric position (u, v, w, h) of the target object based on the width priori knowledge w and the height priori knowledge h. Finally, the NMS method may be employed to eliminate redundant detection.

As shown in fig. 10, FP unit 1030 may include a convolutional layer 1031 and an upsampling layer 1032. The channel dimension of the input data may be reduced from c/2 to c/4 via the convolutional layer 1031 and upsampling layer 1032. The characteristic data with the channel dimension of c/4 can be compared with the characteristic data when the FP unit 1030 belongs to the 1 st detection sub-network to the (N-1) th detection sub-networkFusion is performed. Wherein, when the first feature extraction network and the second feature extraction network are provided with a conversion layer after each feature projection layer, the feature data with the channel dimension of c/4 can be combined with the feature data ∈>The fusion is via a concat () function. As shown in fig. 10, the FP element 1030 may also be provided with another convolutional layer 1033 to get finer-scale feature data to be transferred to the PCD element in the next detection sub-network. For example, the convolution kernel size of the convolution layer 1033 is 1×1, and the channel dimension of the data can be converted from 3c/4 to c/2 by processing through the convolution layer 1033.

It will be appreciated that the above structural arrangements of PCD units, FD units and PF units are merely examples to facilitate an understanding of the present disclosure, which is not limited thereto. For example, the convolution kernel size of the convolution layers in each unit, the number of the convolution layers, and the like can be set according to actual requirements.

The present disclosure uses the collected SICD dataset and VSCD dataset to test a target detection model trained based on the model structures in fig. 6, 8 and 10, and performance of the several models is evaluated by taking MAP as a metric, and the evaluation results are shown in table 2 below. According to the test result, compared with the traditional method, the performance of the three models is greatly improved, the network model learned from end to end realizes the joint optimization of the change detection task, and the overall performance is improved. The models in fig. 8 and 10 have significantly improved performance compared to the model in fig. 6. In terms of video data, the model in fig. 10 performs better, reaching 76.1% MAP.

TABLE 2

The present disclosure also introduces a data set R-VSCD acquired in an actual scene, the data in the data set indicating changes in the target object in the actual scene. Based on the data set R-VSCD, the present disclosure evaluates the performance of the target detection model trained from the model structures in the foregoing fig. 6, 8, and 10 in a high-precision map change detection scenario. Because of the limited amount of data collected by the data set R-VSCD, no meaningful MAPs can be generated. The present disclosure employs top-1 precision to evaluate the performance of each model in the high-precision map change detection scenario. The top-1 precision refers to the accuracy rate of the category with the highest probability conforming to the actual category, if the category with the highest probability conforms to the actual category in the prediction result, the prediction is accurate, otherwise, the prediction is incorrect. The evaluation results are shown in table 3 below. Based on the evaluation result, the accuracy of the model in fig. 10 can be 81%. Wherein the data set R-VSCD may include data collected from a plurality of different cities. Because the settings of the target objects such as traffic lights in different cities are obviously different, the method can provide higher challenges for the target detection model to a certain extent, and can indicate that the generalization capability of the end-to-end learning network model provided by the disclosure is stronger to a certain extent.

Method	Top-1 precision
		YoLov3+D	0.558
Diff_Net	0.725
		Diff_Net+ConvLSTM	0.810

TABLE 3 Table 3

Through testing, the present disclosure finds that features at the coarse scale are more focused on larger objects, while features at the fine scale are more focused on smaller objects, consistent with the design objectives of the present disclosure. Through extraction and comparison of a plurality of scale features, accurate identification of objects with different sizes can be realized, and the accuracy of the determined prediction update information is improved.

According to embodiments of the present disclosure, training of the object detection model of the present disclosure and implementation using a PaddlePaddle platform and a Tensorflow framework. In training the object detection model, 8 NVIDIA Tesla P40 graphics processors may be used to run on a workstation. In the training process, for example, an Adam optimization algorithm can be adopted, and the learning rate is set to be e-4. The batch size in the training process is 48, and when the Epoch reaches 40, the training is stopped. For training of the model shown in fig. 10, the batch size was set to 8.

Based on the training method of the target detection model, the invention also provides a method for determining the updated information of the image by adopting the target detection model. This method will be described below in connection with fig. 11.

Fig. 11 is a flowchart of a method of determining update information of an image using an object detection model according to an embodiment of the present disclosure.

As shown in fig. 11, the method 1100 of determining update information of an image using an object detection model of this embodiment may include operations S1110 to S1140. The target detection model is trained by the training method.

In operation S1110, a second image to be updated corresponding to the second live-action image is determined.

According to an embodiment of the present disclosure, an image corresponding to a second live-action image in an offline map may be first determined, and the image may be taken as an initial image. The initial image is then converted to a raster image based on the target object comprised by the initial image, resulting in a second image to be updated. It is to be understood that the method for determining the second image to be updated is similar to the method for determining the raster image described above, and will not be described herein.

It will be appreciated that the image in the offline map corresponding to the second live-action image may be determined based on the pose of the image capture device for the second live-action image. The gesture of the image acquisition device aiming at the second live-action image is the gesture when the image acquisition device acquires the second live-action image. The method for obtaining the corresponding image based on the posture of the image capturing device is similar to the method described above, and will not be repeated here.

In operation S1120, the second live-action image is input into the first feature extraction network of the target detection model to obtain third feature data. This operation S1120 is similar to the operation S510 described above, and will not be described again.

In operation S1130, the second image to be updated is input to the second feature extraction network of the object detection model to obtain fourth feature data. This operation S1130 is similar to the operation S520 described above, and will not be described again.

In operation S1140, the third feature data and the fourth feature data are input into the target detection network of the target detection model, so as to obtain update information of the second live-action image relative to the second image to be updated. This operation S1140 is similar to the operation S530 described above, and will not be described again.

According to an embodiment of the present disclosure, the aforementioned offline map may be a high-precision map. The method for determining the update information of the image by adopting the target detection model can be applied to a scene of updating a high-precision map.

Correspondingly, the disclosure also provides a method for updating the high-precision map, which can firstly determine an image corresponding to an acquired live-action image (such as a third live-action image) in the high-precision map to obtain a third image to be updated. And then the method for determining the update information of the image by adopting the target detection model is adopted to obtain the update information of the third live-action image relative to the third image to be updated. So that the high-definition map can be updated based on the update information. For example, if the determined update information includes a target object of a deletion category and position information of the target object, the target object in the high-precision map may be located based on the position information, and the located target object may be deleted from the high-precision map. The method for obtaining the third image to be updated is similar to the method for obtaining the second image to be updated described above, and will not be described herein.

Based on the method for generating the sample image pair, the disclosure also provides a device for generating the sample image pair. The device will be described in detail below in connection with fig. 12.

Fig. 12 is a block diagram of an apparatus for generating a sample image pair according to an embodiment of the present disclosure.

As shown in fig. 12, the apparatus 1200 of generating a sample image pair of this embodiment may include a first location determination module 1210, a category combination determination module 1220, a second location determination module 1230, a first image generation module 1240, and a first label addition module 1250.

The first position determining module 1210 is configured to determine position information of a target object included in a live-action image based on first map information corresponding to the live-action image. In an embodiment, the first location determining module 1210 may be configured to perform the operation S110 described above, which is not described herein.

The category combination determination module 1220 is configured to determine a random category combination of the target objects included in the live-action image based on the predetermined categories. Wherein the predetermined categories may include an add category and an unchanged category. In an embodiment, the category combination determining module 1220 may be used to perform the operation S120 described above, which is not described herein.

The second location determining module 1230 is configured to determine location information of a deletion object for the live-action image based on location information of a target object included in the live-action image. In an embodiment, the second location determining module 1230 may be used to perform the operation S130 described above, which is not described herein.

The first image generating module 1240 is configured to generate a first image to be updated for the live-action image based on the position information of the deleted object, the position information of the target object included in the live-action image, and the random class combination. In an embodiment, the first image generating module 1240 may be configured to perform the operation S140 described above, which is not described herein.

The first tagging module 1250 may be used to tag an image pair consisting of a first image to be updated and a live image. In an embodiment, the first label adding module 1250 may be used to perform the operation S150 described above, and will not be described herein.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a second image generating module for generating a first raster image for the live-action image, based on the first map information and the position information of the target object included in the live-action image, the first raster image indicating the position of the target object included in the live-action image. The first image generating module 1240 is specifically configured to: and adjusting the first raster image based on the position information of the deleted object and the random class combination to obtain a first image to be updated. The first image to be updated indicates the position of the deleted object in the live-action image and the position of the target object of the unchanged class in the live-action image.

According to an embodiment of the present disclosure, the second position determining module includes: a candidate region determination submodule for determining a first region in the first raster image based on the predetermined position distribution information, the first region including, as a candidate region, a region whose distribution density determined based on the position distribution information is greater than a predetermined density; and a position determination sub-module for determining position information of the deletion object based on other areas than a second area among the candidate areas, the second area including an area indicating a position of the target object included in the live-action image. Wherein the predetermined position distribution information is determined based on position information of a target object included in the plurality of live-action images, the plurality of live-action images being equal in size to each other.

According to an embodiment of the present disclosure, the position determining sub-module includes a size determining unit and a position determining unit. The size determining unit is used for determining the target size based on the size of the second area and the number of target objects included in the live-action image. The position determining unit is used for determining any area with the size equal to the target size in other areas to obtain the position information of the deleted object.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include an image obtaining module, configured to obtain, from a live-action image library, an image that meets a predetermined position constraint condition, and obtain the live-action image.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a third location determining module, a category determining module, a spatial information determining module, a third image generating module, and a second tag adding module. The third position determining module is used for determining position information of a target object included in other frames, except for the target frame, of the live-action video based on second map information corresponding to the other frames. The category determination module is used for determining the categories of the target objects included in other frames based on the categories of the target objects included in the target frames. The spatial information determining module is used for determining three-dimensional spatial information of the deleted object based on the depth of the target object included in the target frame and the position information of the deleted object. The third image generation module is used for generating a second image to be updated for other frames based on the three-dimensional space information and the position information and the category of the target object included in the other frames. The second label adding module is used for adding labels to the image pairs formed by the second image to be updated and other frames.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a fourth image generating module configured to generate a second raster image for another frame based on the second map information and the position information of the target object included in the other frame, where the second raster image indicates the position of the target object included in the other frame. The third image generation module is used for adjusting the second grid image based on the three-dimensional space information and the category of the target object included in other frames to obtain a second image to be updated. The second image to be updated indicates the position of the deletion object in the other frame and the position of the target object of the unchanged class in the other frame.

According to an embodiment of the present disclosure, the apparatus 1200 for generating a sample image pair may further include a map information determining module configured to determine first map information corresponding to the live-action image in the offline map based on a pose of the image capturing apparatus for the live-action image.

Based on the training method of the target detection model provided by the disclosure, the disclosure also provides a training device of the target detection model. The device will be described in detail below in connection with fig. 13.

Fig. 13 is a block diagram of a training apparatus of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 13, the training apparatus 1300 of the object detection model of this embodiment may include a first data obtaining module 1310, a second data obtaining module 1320, an update information predicting module 1330, and a model training module 1340. The target detection model comprises a first feature extraction network, a second feature extraction network and a target detection network.

The first data obtaining module 1310 is configured to input a first live-action image in the sample image pair into the first feature extraction network, to obtain first feature data. In an embodiment, the first data obtaining module 1310 may be configured to perform the operation S510 described above, which is not described herein.

The second data obtaining module 1320 is configured to input the first image to be updated in the sample image pair into the second feature extraction network to obtain second feature data. Wherein the sample image pair has a label indicating actual update information of the first live-action image relative to the first image to be updated. In an embodiment, the second data obtaining module 1320 may be used to perform the operation S520 described above, which is not described herein.

The update information prediction module 1330 is configured to input the first feature data and the second feature data into the target detection network, so as to obtain predicted update information of the first live-action image relative to the first image to be updated. In an embodiment, the update information prediction module 1330 may be used to perform the operation S530 described above, which is not described herein.

Model training module 1340 is configured to train the target detection model based on the actual update information and the predicted update information. In an embodiment, the model training module 1340 may be used to perform the operation S540 described above, which is not described herein.

According to an embodiment of the present disclosure, an object detection network includes a parallel cross difference unit and a feature detection unit. The update information prediction module 1330 may include a cross difference acquisition sub-module and an update prediction sub-module. The cross difference obtaining sub-module is used for obtaining parallel cross difference data by adopting a parallel cross difference unit based on the first characteristic data and the second characteristic data. The updating prediction sub-module is used for inputting the obtained parallel cross difference data into the feature detection unit to obtain prediction update information.

According to an embodiment of the disclosure, the first feature extraction network and the second feature extraction network each comprise a feature extraction unit and N feature projection layers connected in sequence, N being an integer greater than 1. The first data obtaining module may include a first data obtaining sub-module and a second data obtaining sub-module. The first data obtaining submodule is used for inputting the first live-action image into a feature extraction unit included in the first feature extraction network to obtain first initial data. The second data obtaining submodule is used for inputting the first initial data into a first projection layer in N feature projection layers included in the first feature extraction network to obtain the ith data of the first live-action image output by the ith projection layer. The second data obtaining module comprises a third data obtaining sub-module and a fourth data obtaining sub-module. The third data obtaining sub-module is used for inputting the first image to be updated into a feature extraction unit included in the second feature extraction network to obtain second initial data. The fourth data obtaining sub-module is used for inputting the second initial data into the first projection layer of the N feature projection layers included in the second feature extraction network to obtain the j-th data of the first image to be updated output by the j-th projection layer.

According to an embodiment of the present disclosure, a target detection network includes an information screening sub-network, N parallel cross difference units, and N feature detection units. Wherein a parallel cross difference unit and a feature detection unit form a detection sub-network. The cross difference obtaining sub-module is used for obtaining (N-i+1) th parallel cross difference data by adopting a (N-i+1) th parallel cross difference unit in the detection sub-network based on the i-th data of the first live-action image and the i-th data of the first image to be updated. The update prediction submodule comprises a candidate information obtaining unit and an information screening unit. The candidate information obtaining unit is used for inputting the (N-i+1) th parallel cross difference data into the feature detection unit in the (N-i+1) th detection sub-network to obtain the (N-i+1) th candidate prediction information. The information screening unit is used for inputting the obtained N candidate prediction information into the information screening sub-network to obtain the prediction update information of the first live-action image relative to the first image to be updated.

According to an embodiment of the present disclosure, the 1 st to (N-1) th detection subnetworks each further comprise a feature propagation unit. The above-described cross-difference obtaining sub-module may include a data fusion unit and a cross-difference obtaining unit. The data fusion unit is used for inputting the (N-i) th parallel cross difference data and the (i) th data of the first live-action image into the characteristic propagation unit of the (N-i) th detection sub-network under the condition that i is smaller than N, so as to obtain fusion data. The cross difference obtaining unit is used for inputting the fused data and the ith data of the first image to be updated into the parallel cross difference unit of the (N-i+1) th detection sub-network to obtain the (N-i+1) th parallel cross difference data. The cross difference obtaining unit is further used for inputting the ith data of the first live-action image and the ith data of the first image to be updated into the parallel cross difference unit in the 1 st detection sub-network under the condition that i is equal to N, so as to obtain the 1 st parallel cross difference data.

According to an embodiment of the present disclosure, the first live-action image comprises a video frame, and the 1 st detection sub-network further comprises a recurrent neural network element. The above-described cross difference obtaining unit is configured to obtain 1 st parallel cross difference data by: inputting the N data of the first live-action image and the N data of the first image to be updated into a parallel cross difference unit in a 1 st detection sub-network to obtain initial cross difference data; and inputting the initial cross difference data and the state data of the cyclic neural network unit into the cyclic neural network unit to obtain the 1 st parallel cross difference data. Wherein the state data is obtained by the recurrent neural network element based on a previous video frame of the current video frame.

According to an embodiment of the present disclosure, the parallel cross difference unit may include a first inversion layer, a second inversion layer, a first splice layer, a second splice layer, a third splice layer, and a data fusion layer. The cross difference obtaining sub-module is used for obtaining (N-i+1) th parallel cross difference data by the following modes: taking the first data input into the parallel cross difference unit as the input of a first inversion layer to obtain first inversion data of the first data; taking the second data input into the parallel cross difference unit as the input of a second inversion layer to obtain second inversion data of the second data; taking the first data and the second data as the first data; inputting a first splicing layer to obtain first splicing data; taking the second data and the first inverse data as the input of a second splicing layer to obtain second splicing data; taking the first data and the second inverse data as the input of a third splicing layer to obtain third splicing data; and inputting the first spliced data, the second spliced data and the third spliced data into a data fusion layer to obtain (N-i+1) th parallel cross difference data.

According to embodiments of the present disclosure, model training module 1340 may include a loss determination sub-module and a model training sub-module. The loss determination submodule is used for determining the value of a predetermined loss function based on actual update information and prediction update information. The model training sub-module is used for training the target detection model based on the value of the preset loss function.

According to an embodiment of the present disclosure, the actual update information includes actual location information and an actual category of the target object; the prediction update information includes prediction position information, a prediction category, and a confidence of the target object. The above-described loss determination submodule may include a first determination unit, a second determination unit, a third determination unit, and a loss determination unit. The first determining unit is used for determining the value of the positioning loss function in the predetermined loss function based on the actual position information and the predicted position information, and obtaining a first value. The second determining unit is used for determining the value of the confidence coefficient loss function in the preset loss function based on the actual position information and the confidence coefficient to obtain a second value. The third determining unit is used for determining the value of the class loss function in the predetermined loss function based on the actual class and the predicted class, and obtaining a third value. And the loss determination unit is used for determining the weighted sum of the first value, the second value and the third value to obtain the value of the preset loss function.

Based on the method for determining the update information of the image by adopting the target detection model provided by the disclosure, the disclosure also provides a device for determining the update information of the image by adopting the target detection model. The device will be described in detail below in connection with fig. 14.

Fig. 14 is a block diagram of an apparatus for determining update information of an image using an object detection model according to an embodiment of the present disclosure.

As shown in fig. 14, the apparatus 1400 for determining update information of an image using an object detection model of this embodiment may include an image determination module 1410, a third data acquisition module 1420, a fourth data acquisition module 1430, and an update information determination module 1440. The target detection model is obtained by training by using the training device of the target detection model.

The image determining module 1410 is configured to determine a second image to be updated corresponding to the second live-action image. In an embodiment, the image determining module 1410 may be configured to perform the operation S1110 described above, which is not described herein.

The third data obtaining module 1420 is configured to input the second live-action image into the first feature extraction network of the target detection model to obtain third feature data. In an embodiment, the third data obtaining module 1420 may be configured to perform the operation S1120 described above, which is not described herein.

The fourth data obtaining module 1430 is configured to input the second image to be updated into the second feature extraction network of the target detection model to obtain fourth feature data. In an embodiment, the fourth data obtaining module 1430 may be configured to perform the operation S1130 described above, which is not described herein.

The update information determining module 1440 is configured to input the third feature data and the fourth feature data into the target detection network of the target detection model, so as to obtain update information of the second live-action image relative to the second image to be updated. In an embodiment, the update information determining module 1440 may be configured to perform the operation S1140 described above, which is not described herein.

According to embodiments of the present disclosure, the image determination module 1410 may include an image determination sub-module and an image conversion sub-module. The image determining submodule is used for determining an image corresponding to the second live-action image in the offline map as an initial image. The image conversion submodule is used for converting the initial image into a grid image based on a target object included in the initial image to obtain a second image to be updated.

According to an embodiment of the present disclosure, the image determination sub-module is configured to determine an image in the offline map corresponding to the second live-action image based on a pose of the image capture device for the second live-action image.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 15 shows a schematic block diagram of an example electronic device 1500 that may be used to implement the methods of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the apparatus 1500 includes a computing unit 1501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data required for the operation of the device 1500 may also be stored. The computing unit 1501, the ROM 1502, and the RAM 1503 are connected to each other through a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

Various components in device 1500 are connected to I/O interface 1505, including: an input unit 1506 such as a keyboard, mouse, etc.; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508 such as a magnetic disk, an optical disk, or the like; and a communication unit 1509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1509 allows the device 1500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1501 performs the various methods and processes described above, for example, at least one of the following methods: a method of generating a sample image pair, a training method of a target detection model, and a method of determining updated information of an image using a target detection model. For example, in some embodiments, at least one of the method of generating a sample image pair, the method of training the object detection model, and the method of determining updated information of the image using the object detection model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into the RAM1503 and executed by the computing unit 1501, one or more steps of at least one of the above-described method of generating a sample image pair, the training method of the object detection model, and the method of determining update information of an image using the object detection model may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured to perform at least one of the following methods by any other suitable means (e.g., by means of firmware): a method of generating a sample image pair, a training method of a target detection model, and a method of determining updated information of an image using a target detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of generating a sample image pair, comprising:

determining the position information of a target object included in a target live-action image based on first map information corresponding to the target live-action image;

generating a first raster image for the target live-action image based on the first map information and the position information of the target object included in the target live-action image, the first raster image indicating the position of the target object included in the target live-action image;

Determining a random class combination of target objects included in the target live-action image based on predetermined classes;

determining the position information of a deleted object aiming at the target live-action image based on the position information of the target object included in the target live-action image; and

adjusting the first grid image based on the position information of the deleted object, the position information of the target object included in the target live-action image and the random class combination to obtain a first image to be updated for the target live-action image, and adding a label to an image pair formed by the first image to be updated and the target live-action image, wherein the first image to be updated indicates the position of the deleted object in the target live-action image and the position of the target object with no change class in the target live-action image,

wherein the tag indicates actual update information of the target live-action image relative to the first image to be updated; the predetermined categories include an add category and an unchanged category.

2. The method of claim 1, wherein determining location information for a deleted object comprises:

determining a first region in the first raster image based on predetermined position distribution information as a candidate region, the first region including a region having a distribution density greater than a predetermined density determined based on the position distribution information; and

Determining position information of the deletion object based on other areas than a second area among the candidate areas, the second area including an area indicating a position of a target object included in the target live-action image,

wherein the predetermined position distribution information is determined according to position information of a target object included in the plurality of target live-action images; the sizes of the plurality of target live-action images are equal to each other.

3. The method of claim 2, wherein the determining the location information of the deletion object based on the other areas than the second area in the candidate area comprises:

determining a target size based on the size of the second region and the number of target objects included in the target live-action image; and

and determining any area with the size equal to the target size in the other areas to obtain the position information of the deleted object.

4. The method of claim 1, further comprising:

and obtaining an image meeting the constraint condition of the preset position from a live-action image library to obtain the target live-action image.

5. The method of claim 1, wherein the target live-action image comprises a target frame in a live-action video; the method further includes, for other frames in the live-action video than the target frame:

Determining position information of a target object included in the other frame based on second map information corresponding to the other frame;

determining the category of the target object included in the other frames based on the category of the target object included in the target frame;

determining three-dimensional space information of the deleted object based on the depth of the target object included in the target frame and the position information of the deleted object; and

generating a second image to be updated for the other frames based on the three-dimensional space information and the position information and the category of the target object included in the other frames, and adding tags to image pairs formed by the second image to be updated and the other frames.

6. The method of claim 5, further comprising:

generating a second raster image for the other frame based on the second map information and the position information of the target object included in the other frame, the second raster image indicating the position of the target object included in the other frame;

wherein generating a second image to be updated corresponding to the other frame comprises: based on the three-dimensional space information and the category of the target object included in the other frames, the second raster image is adjusted to obtain the second image to be updated,

Wherein the second image to be updated indicates a position of the deletion object in the other frame and a position of the target object of the unchanged class in the other frame.

7. The method of claim 1, further comprising:

and determining first map information corresponding to the target live-action image in an offline map based on the gesture of the image acquisition device aiming at the target live-action image.

8. An apparatus for generating a sample image pair, comprising:

the first position determining module is used for determining the position information of a target object included in the target live-action image based on first map information corresponding to the target live-action image;

a second image generation module configured to generate a first raster image for the target live-action image based on the first map information and position information of a target object included in the target live-action image, the first raster image indicating a position of the target object included in the target live-action image;

a category combination determining module, configured to determine a random category combination of the target objects included in the target live-action image based on a predetermined category;

the second position determining module is used for determining the position information of the deleted object aiming at the target live-action image based on the position information of the target object included in the target live-action image;

The first image generation module is used for adjusting the first grid image based on the position information of the deleted object, the position information of the target object included in the target live-action image and the random class combination to obtain a first image to be updated aiming at the target live-action image, wherein the first image to be updated indicates the position of the deleted object in the target live-action image and the position of the target object with no change class in the target live-action image; and

a first label adding module for adding labels to image pairs formed by the first image to be updated and the target live-action image,

9. The apparatus of claim 8, wherein the second location determination module comprises:

a candidate region determination submodule configured to determine, based on predetermined position distribution information, a first region in the first raster image as a candidate region, the first region including a region whose distribution density determined based on the position distribution information is greater than a predetermined density; and

A position determination sub-module for determining position information of the deletion object based on other areas than a second area among the candidate areas, the second area including an area indicating a position of a target object included in the target live-action image,

10. The apparatus of claim 9, wherein the location determination submodule comprises:

a size determining unit configured to determine a target size based on a size of the second area and the number of target objects included in the target live-action image; and

and a position determining unit, configured to determine any area with a size equal to the target size in the other areas, and obtain position information of the deletion object.

11. The apparatus of claim 8, further comprising:

the image obtaining module is used for obtaining images meeting the constraint conditions of the preset positions from the live-action image library to obtain the target live-action image.

12. The apparatus of claim 8, wherein the target live-action image comprises a target frame in a live-action video; the device comprises:

A third position determining module, configured to determine, for other frames than the target frame in the live-action video, position information of a target object included in the other frames based on second map information corresponding to the other frames;

the category determining module is used for determining the category of the target object included in the other frames based on the category of the target object included in the target frame;

a spatial information determining module, configured to determine three-dimensional spatial information of the deletion object based on a depth of a target object included in the target frame and position information of the deletion object;

a third image generating module, configured to generate a second image to be updated for the other frame based on the three-dimensional spatial information and the position information and the category of the target object included in the other frame; and

and a second label adding module, configured to add labels to image pairs composed of the second image to be updated and the other frames.

13. The apparatus of claim 12, further comprising:

a fourth image generation module configured to generate a second raster image for the other frame based on the second map information and position information of the target object included in the other frame, the second raster image indicating a position of the target object included in the other frame;

The third image generation module is specifically configured to: based on the three-dimensional space information and the category of the target object included in the other frames, the second raster image is adjusted to obtain the second image to be updated,

14. The apparatus of claim 8, further comprising:

and the map information determining module is used for determining first map information corresponding to the target live-action image in an offline map based on the gesture of the image acquisition device aiming at the target live-action image.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.

18. A method of updating a high-precision map, comprising:

determining an image corresponding to the acquired live-action image in the high-definition map to obtain a third image to be updated;

inputting the acquired live-action image into a first feature extraction network of a target detection model to obtain first feature data;

inputting the third image to be updated into a second feature extraction network of the target detection model to obtain second feature data;

inputting the first characteristic data and the second characteristic data into a target detection network of the target detection model to obtain update information of the acquired live-action image relative to the third image to be updated; and

updating the high-precision map based on the update information,

wherein the object detection model is trained based on pairs of sample images generated by the method of any one of claims 1 to 7.