CN113392861A

CN113392861A - Model training method, map drawing method, device, computer device and medium

Info

Publication number: CN113392861A
Application number: CN202010169738.6A
Authority: CN
Inventors: 杨恒
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2021-09-14

Abstract

The present disclosure provides a model training method and device, the method comprising: constructing an initial network model, comprising: a backbone network, a regional suggestion network, a mask network, and an attention-based recognition network. And the backbone network performs feature extraction on the input picture to obtain a feature map. The regional suggestion network is used to generate candidate target boxes. And obtaining a mask prediction result by the mask network based on the feature map and the candidate target frame. And the identification network obtains a classification prediction result and an outsourcing frame prediction result based on the feature map, the candidate target frame and the mask prediction result. Obtaining a plurality of training pictures and labels of the training pictures, wherein the label of any training picture comprises: the category, the bounding box, and the mask of the target object in any of the training pictures. And training the initial network model by utilizing the plurality of training pictures and the labels of the plurality of training pictures to obtain a target network model. The disclosure also provides a mapping method and apparatus, a computer device and a medium.

Description

Model training method, map drawing method, device, computer device and medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a model training method and device, a map drawing method and device, computer equipment and a medium.

Background

In the production process of high-precision maps, various target objects in roads need to be drawn.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: if various target objects are artificially identified, extracted and drawn from the road picture, a large amount of labor cost is needed and the efficiency is extremely low, and an automatic alternative scheme is urgently needed. However, in the existing automatic identification scheme for the target object, the positioning of the outsourcing frame of the target object is often deviated due to the influence of the background area, and the identification deviation of the target object is further caused.

Disclosure of Invention

In view of this, embodiments of the present invention provide a model training method and apparatus, a mapping method and apparatus, a computer device, and a medium, so as to more accurately identify and segment target objects in a road picture, thereby implementing high-precision mapping including various target objects.

One aspect of the embodiments of the present invention provides a model training method, including: in one aspect, an initial network model is constructed. The initial network model includes: a backbone network, a regional suggestion network, a mask network, and an attention-based recognition network. The backbone network is used for extracting features of the input picture to obtain a feature map. The regional suggestion network is used to generate candidate target boxes. The mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame. And the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the characteristic diagram, the candidate target frame and the mask prediction result. On the other hand, a plurality of training pictures and labels of the plurality of training pictures are acquired. Wherein, the label of any training picture in a plurality of training pictures that acquire includes: the category, the bounding box, and the mask of the target object in any of the training pictures. And then, training the initial network model by utilizing the plurality of training pictures and the labels of the plurality of training pictures to obtain a target network model.

According to an embodiment of the present invention, the training the initial network model by using the plurality of training pictures and the labels of the plurality of training pictures includes: for any training picture in a plurality of training pictures, inputting the training picture into the backbone network so as to enable the backbone network to output a first feature map aiming at the training picture. And inputting the any training picture into the area suggestion network so that the area suggestion network outputs a first candidate target frame aiming at the any training picture. Then, the first candidate target frame is applied to the first feature map to obtain a first region-of-interest feature map. And inputting the first region-of-interest feature map into a mask network so that the mask network outputs a mask prediction result for any training picture. Then, based on the mask prediction result for the any one of the training pictures and the label of the any one of the training pictures, a loss value of the first function is calculated. Thus, the parameters of at least one of the backbone network, the area proposal network and the mask network can be adjusted based on the loss value of the first function until the first function converges.

According to an embodiment of the present invention, the mask prediction result for any training picture includes a plurality of mask prediction results. The calculating a loss value of the first function based on the mask prediction result of the training picture and the label of the training picture includes: and determining the category and the mask of any training picture according to the label of any training picture. Then, based on the mask of the training picture and the first function, a loss value of a mask prediction result corresponding to the category of the training picture is calculated.

According to an embodiment of the present invention, the training the initial network model by using the plurality of training pictures and the labels of the plurality of training pictures further includes: based on a mask of any training picture, a plurality of mask prediction results aiming at any training picture and a first function, calculating respective loss values of the plurality of mask prediction results. And converting to obtain a confidence map based on the loss values of the mask prediction results. Then, the confidence map and the first region-of-interest feature map are input to a recognition network, so that the recognition network outputs a classification prediction result and an outsourcing frame prediction result for any training picture. Next, a loss value of the second function is calculated based on the classification prediction result for the any one training picture, the outsourcing frame prediction result, and the label of the any one training picture. And adjusting the parameters of the identified network based on the loss value of the second function until the second function is converged.

According to an embodiment of the present invention, the converting the confidence map based on the loss value of each of the plurality of mask prediction results includes: and for any mask prediction result in the mask prediction results, negating the loss value of each pixel in the mask prediction result and adding 1 to obtain the confidence value of each pixel. A confidence map for the arbitrary mask predictor is then determined based on the confidence value for each pixel in the arbitrary mask predictor.

According to an embodiment of the present invention, the method further includes: before the first interesting area feature map is input into the mask network, the first interesting area feature map is subjected to scale normalization processing. And/or performing scale normalization processing on the first region-of-interest feature map before inputting the first region-of-interest feature map into the identification network.

According to an embodiment of the present invention, the method further includes: before inputting the confidence map and the first region-of-interest feature map into the recognition network, performing dimension reduction processing on the confidence map. The inputting the confidence map and the first region of interest feature map into the recognition network includes: and accumulating the confidence map subjected to dimension reduction processing and the first region-of-interest feature map in the channel dimension to obtain a first feature map to be identified. And inputting the first characteristic diagram to be recognized into the recognition network.

Another aspect of the embodiments of the present invention provides a map drawing method, including: on the one hand, a road picture is acquired. On the other hand, a target network model obtained by training according to the model training method described in any of the above embodiments is obtained. Then, the road picture is processed by using the target network model, so that: the mask network in the target network model outputs a mask of the target object in the road picture, and the attention-based recognition network in the target network model outputs the category and the outsourcing frame of the target object in the road picture. Then, the target object is drawn in the map based on the mask, the category, and the bounding box of the target object in the road picture.

According to an embodiment of the present invention, the processing the road picture by using the target network model includes: and inputting the road picture into a backbone network of the target network model so that the backbone network outputs a second feature map for the road picture, and inputting the road picture into a region suggestion network of the target network model so that the region suggestion network outputs a second candidate target frame for the road picture. Then, the second feature map of the road picture is input into the mask network, so that the mask network outputs a mask prediction result for the road picture. And respectively acting the second candidate target frame on the second feature map and the mask prediction result aiming at the road picture to obtain a second interested area feature map and an interested area mask prediction result. And then, inputting the second region-of-interest feature map and the region-of-interest mask prediction result into the recognition network, so that the recognition network outputs the category and the outsourcing frame of the target object in the road picture.

According to an embodiment of the present invention, the method further includes: before the second interested area feature map and the interested area mask prediction result are input into the identification network, scale normalization processing is respectively carried out on the second interested area feature map and the interested area mask prediction result. The inputting the second region-of-interest feature map and the region-of-interest mask prediction result into the recognition network includes: and accumulating the second region-of-interest feature map subjected to the scale normalization and the region-of-interest mask prediction result in the channel dimension to obtain a second feature map to be identified for the road picture. And then, inputting the second characteristic diagram to be identified into the identification network, so that the identification network outputs the category and the outsourcing frame of the target object by combining the non-maximum suppression algorithm.

According to an embodiment of the present invention, the method further includes: and determining a mask prediction result corresponding to the category of the target object from the mask prediction results for the road picture based on the category of the target object in the road picture to serve as the mask of the target object in the road picture.

According to an embodiment of the present invention, a target object includes: the lane marking line.

Another aspect of the embodiments of the present invention provides a model training apparatus, including: the device comprises a construction module, a sample acquisition module and a training module. The building module is used for building an initial network model. The initial network model includes: a backbone network, a regional suggestion network, a mask network, and an attention-based recognition network. The backbone network is used for extracting features of the input picture to obtain a feature map. The regional suggestion network is used to generate candidate target boxes. The mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame. And the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the characteristic diagram, the candidate target frame and the mask prediction result. The sample acquisition module is used for acquiring a plurality of training pictures and labels of the training pictures. Wherein, the label of any training picture in a plurality of training pictures that acquire includes: the category, the bounding box, and the mask of the target object in any of the training pictures. The training module is used for training the initial network model by utilizing the plurality of training pictures and the labels of the plurality of training pictures to obtain the target network model.

Another aspect of an embodiment of the present invention provides a map drawing apparatus, including: the device comprises a first acquisition module, a second acquisition module, a processing module and a drawing module. The first acquisition module is used for acquiring a road picture. A second obtaining module, configured to obtain a target network model obtained through training according to the model training method described in any one of the embodiments above. The processing module is used for processing the road picture by using the target network model so as to: the mask network in the target network model outputs a mask of the target object in the road picture, and the attention-based recognition network in the target network model outputs the category and the outsourcing frame of the target object in the road picture. The drawing module is used for drawing the target object in the map based on the mask, the category and the outer enclosure of the target object in the road picture.

Another aspect of the embodiments of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

Another aspect of embodiments of the present invention provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of embodiments of the present invention provides a computer program comprising computer executable instructions for implementing a method as described above when executed.

According to an embodiment of the invention, the target network model is obtained by supervised training for the initial network model. The initial model is composed of a backbone network, an area suggestion network, a mask network and an identification network based on an attention mechanism, and comprises three output branches: the mask network outputs branches with respect to the mask prediction results and identifies the output branches of the network with respect to the classification prediction results and the outsourcing frame prediction results. An attention mechanism is introduced into the recognition network, and the output of the mask network is taken as part of the output of the recognition network, so that the recognition network can focus more on the spatial position distribution of the target object to position the target object. Therefore, a target network model with better performance can be obtained through training, positioning deviation aiming at the target object is avoided, and the identification deviation of the target object is reduced.

Drawings

The above and other objects, features and advantages of the embodiments of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an exemplary system architecture of an application model training method and apparatus and a mapping method and apparatus according to embodiments of the invention;

FIG. 2 schematically shows a flow diagram of a model training method according to an embodiment of the invention;

FIG. 3 schematically illustrates an example flowchart of operation S230 of FIG. 2, in accordance with an embodiment of the present invention;

FIG. 4 schematically illustrates an example architectural diagram of an initial network model in accordance with an embodiment of this invention;

FIG. 5 schematically illustrates a flow chart of a mapping method according to an embodiment of the invention;

FIG. 6 schematically illustrates an example flowchart of operation S530 in FIG. 5, in accordance with an embodiment of the present invention;

FIG. 7 schematically illustrates an example structure of a road picture according to an embodiment of this invention;

FIG. 8 schematically shows a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 9 schematically shows a block diagram of a mapping apparatus according to an embodiment of the present invention; and

FIG. 10 schematically shows a block diagram of a computer device according to an embodiment of the invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of embodiments of the present invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the embodiments of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the invention. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the invention provides a model training method and device, a mapping method and device, computer equipment and a medium. The model training method can comprise a construction process, a sample acquisition process and a training process. In the construction process, an initial network model is constructed. The initial network model may include: a backbone network, a regional suggestion network, a mask network, and an attention-based recognition network. The backbone network is used for extracting features of the input picture to obtain a feature map. The regional suggestion network is used to generate candidate target boxes. The mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame. And the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the characteristic diagram, the candidate target frame and the mask prediction result. In the sample acquisition process, a plurality of training pictures and respective labels of the plurality of training pictures are acquired. Wherein, the label of any training picture in a plurality of training pictures that acquire includes: the category, the bounding box, and the mask of the target object in any of the training pictures. And then, training the constructed initial network model by utilizing the obtained training pictures and the labels of the training pictures to obtain a target network model.

In the production process of high-precision maps, various target objects in roads need to be drawn. If various target objects are artificially identified, extracted and drawn from the road picture, a large amount of labor cost is needed and the efficiency is extremely low, and an automatic alternative scheme is urgently needed. In one processing mode, the identification and extraction of the target object are performed based on an object detection (object detection) technique, such as R-CNN (Region Convolutional Neural Network), Fast R-CNN (Fast Region Convolutional Neural Network), Fast R-CNN (Faster Region Convolutional Neural Network), and the like. In another processing mode, the target object is identified and extracted based on a semantic segmentation (semantic segmentation) technique, such as SCNN (Spatial Convolutional Neural Network). In another processing mode, the target object is identified and extracted based on an instance segmentation (instance segmentation) technique, such as Mask R-CNN (Mask Region Convolutional Neural Network).

However, the position of the bounding box (bounding box) of the target object, which can be extracted only based on the target detection technology, is not suitable for the positioning requirement with higher precision, such as the specific contour and position of the target object cannot be more finely positioned. Although the semantic-based segmentation technology can generally meet the requirement of pixel-level positioning segmentation of a target object, the semantic-based segmentation technology has the problems of discontinuity, insufficient fineness of the segmentation of a fine target object due to a scale problem and the like. Based on the example segmentation technology, the position of the outer covering frame of the target object can be determined, and then the target object in the outer covering frame is extracted and segmented. However, in the process of determining the outsourcing frame, for the slender target object, more contents in the outsourcing frame are background areas, and due to the influence of the background areas, the positioning of the outsourcing frame of the target object deviates, and further the identification deviation of the target object is caused. Based on the problems in the related art, various types of target objects in the road picture cannot be accurately identified and extracted, and the drawing of a high-precision map is affected.

According to the embodiment of the invention, a model training method and device and a map drawing method and device are provided. According to the model training method provided by the embodiment of the invention, the target network model which can be accurately identified and extracted aiming at various target objects can be obtained through training. According to the map drawing method provided by the embodiment of the invention, various target objects can be more accurately identified and extracted from the road picture by using the target network model so as to draw in the map, so that the map with higher precision can be drawn.

FIG. 1 schematically illustrates an exemplary system architecture 100 to which the model training method and apparatus and the mapping method and apparatus may be applied, according to embodiments of the invention. It should be noted that fig. 1 is only an example of a system architecture to which the embodiment of the present invention may be applied, so as to help those skilled in the art understand the technical content of the embodiment of the present invention, and it does not mean that the embodiment of the present invention may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to an embodiment of the present invention may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal apparatuses

101, 102, 103 communicate with the server 105 through the network 104 to receive or transmit messages and the like. The

terminal devices

101, 102, 103 may have installed thereon client applications having various functions, such as navigation-type applications, music-type applications, shopping-type applications, web browser applications, search-type applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices including, but not limited to, car navigation, smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server providing support for various client applications in the

terminal devices

101, 102, 103. The background management server may receive the request message sent by the

terminal device

101, 102, 103, perform a response such as analysis processing on the received request message, and feed back a response result (for example, a web page, information, or data generated according to the request message or the like) to the

terminal device

101, 102, 103, where the

terminal device

101, 102, 103 outputs the response result to the user.

It should be noted that the model training method according to the embodiment of the present invention may be implemented in the

terminal devices

101, 102, and 103, and accordingly, the model training apparatus according to the embodiment of the present invention may be disposed in the

terminal devices

101, 102, and 103. Alternatively, the model training method according to the embodiment of the present invention may be implemented in the server 105, and accordingly, the model training apparatus according to the embodiment of the present invention may be provided in the server 105. Alternatively, the model training method according to the embodiment of the present invention may be implemented in other computer devices capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105, and accordingly, the model training apparatus according to the embodiment of the present invention may be provided in other computer devices capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

The mapping method according to the embodiment of the present invention may be implemented in the

terminal devices

101, 102, 103, and accordingly, the mapping apparatus according to the embodiment of the present invention may be provided in the

terminal devices

101, 102, 103. Alternatively, the map drawing method according to the embodiment of the present invention may be implemented in the server 105, and accordingly, the map drawing apparatus according to the embodiment of the present invention may be provided in the server 105. Alternatively, the mapping method according to the embodiment of the present invention may be implemented in other computer devices that can communicate with the

terminal devices

101, 102, 103 and/or the server 105, and accordingly, the mapping apparatus according to the embodiment of the present invention may be provided in other computer devices that can communicate with the

terminal devices

101, 102, 103 and/or the server 105.

The model training method according to the embodiment of the present invention and the mapping method according to the embodiment of the present invention may be implemented in the same computer device, or may be implemented in different computer devices, which is not limited herein.

It should be understood that the number and types of terminal devices, networks, and servers in fig. 1 are merely illustrative. There may be any number and any type of terminal devices, networks, and servers, depending on the actual needs.

According to an embodiment of the present invention, a model training method is provided. The method is illustrated by the figure below. It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

FIG. 2 schematically shows a flow diagram of a model training method according to an embodiment of the invention.

As shown in fig. 2, the method may include operations S210 to S230.

In operation S210, an initial network model is constructed.

Illustratively, the constructed initial network model may include: backbone networks (backbone networks), regional recommendation networks (RPNs), mask networks (mask networks), and attention-based (attentional mechanics) recognition networks. The backbone network is used for performing feature extraction on an input picture to obtain a feature map (feature map). The regional suggestion network is used to generate candidate target boxes (prebox) that characterize recommendations for outsourced boxes for the target object. The mask network is used for obtaining a mask prediction result based on the obtained feature map and the candidate target frame. And the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the obtained characteristic diagram, the candidate target frame and the mask prediction result. In this example, the classification prediction result is used to represent a prediction result for a category of the target object, the outsourcing frame prediction result is used to represent a prediction result for a position of an outsourcing frame of the target object, the mask prediction result is used to represent a prediction result for a mask of the target object, and the mask of the target object reflects a spatial position distribution of the target object more finely, for example, by binary coding.

Then, in operation S220, a plurality of training pictures and labels (labels) of the plurality of training pictures are acquired.

Wherein, the label of any training picture in a plurality of training pictures that acquire includes: class (class), outsourcing box and mask of the target object in any training picture.

Next, in operation S230, the initial network model is trained using the training pictures and the labels of the training pictures, so as to obtain a target network model.

Those skilled in the art will appreciate that the model training method according to the embodiment of the present invention obtains the target network model through supervised training for the initial network model. The initial model is composed of a backbone network, an area suggestion network, a mask network and an identification network based on an attention mechanism, and comprises three output branches: the mask network outputs branches with respect to the mask prediction results and identifies the output branches of the network with respect to the classification prediction results and the outsourcing frame prediction results. An attention mechanism is introduced into the recognition network, and the output of the mask network is taken as part of the output of the recognition network, so that the recognition network can focus more on the spatial position distribution of the target object to position the target object. Therefore, a target network model with better performance can be obtained through training, positioning deviation aiming at the target object is avoided, and the identification deviation of the target object is reduced.

Fig. 3 schematically illustrates an example flowchart of operation S230 of fig. 2, according to an embodiment of the present invention.

As shown in fig. 3, the training of the initial network model by using the plurality of training pictures and the labels of the plurality of training pictures in operation S230 may include the following operations.

In operation S2301, for any one of a plurality of training pictures, the any one training picture is input to the backbone network, so that the backbone network outputs a first feature map for the any one training picture.

Illustratively, the backbone network may be a convolutional neural network, e.g., consisting of at least one convolutional (convolutional) layer, at least one activation function (ReLU) layer, and at least one pooling (Pooling) layer. The backbone network extracts a first feature map of any training picture, and the first feature map is used for a subsequent regional suggestion network.

Then, in operation S2302, the any one training picture is input to the area suggestion network, so that the area suggestion network outputs a first candidate target box for the any one training picture.

Then, in operation S2303, the first candidate object box is applied to the first feature map to obtain a first Region of Interest (RoI) feature map.

In operation S2304, the first region of interest feature map is input to a mask network, so that the mask network outputs a mask prediction result for the any training picture.

Illustratively, the mask Network may be a Full Convolution Network (FCN) that is applied to any of the region-of-interest feature maps to predict a mask for the pixel granularity of the corresponding target object.

Next, in operation S2305, a loss value of the first function is calculated based on the mask prediction result for the any one of the training pictures and the label of the any one of the training pictures.

Illustratively, the first function may be a weighted Cross Entropy Loss (Sigmoid Cross entry Loss) function or the like, which measures the difference between the mask prediction result output by the mask network and the mask of the target object in the actual training picture.

Next, in operation S2306, based on the loss value of the first function, a parameter of at least one of the backbone network, the area proposal network, and the mask network is adjusted until the first function achieves convergence.

It can be understood that through iterative optimization of network parameters, when the convergence of the first function is realized, the difference between the mask prediction result output by the mask network and the mask of the target object in the actual training picture tends to be minimum, so that the target network model can predict the spatial position distribution of the target object more accurately.

According to an embodiment of the present invention, the mask prediction result for any training picture includes a plurality of mask prediction results for the training picture. On this basis, the calculating a loss value of the first function based on the mask prediction result of the any one training picture and the label of the any one training picture includes: and determining the actual category and the actual mask of any training picture according to the label of any training picture. And selecting a mask prediction result corresponding to the category of the any training picture from a plurality of mask prediction results aiming at the any training picture. Then, based on the actual mask of any training picture and the first function, calculating a loss value of the mask prediction result corresponding to the class of any training picture to measure the difference between the actual mask of any training picture and the mask prediction result of the corresponding class. In the process, only the loss of the mask prediction result of the same category needs to be considered, and the model training efficiency can be improved on the basis of ensuring the model training quality.

For example, the mask network outputs C mask predictions M for a training picture₁～M_CAnd C is an integer larger than 1 and represents the total category number of the target objects in all the training pictures. If the category of the target object in the training picture is the ith category, and i is an integer which is greater than or equal to 1 and less than or equal to C, selecting a mask prediction result M_iAnd comparing with the actual mask of the target object marked in the label to determine the loss value of the first function.

With continued reference to fig. 3, according to the embodiment of the present invention, the training of the initial network model in operation S230 may include not only the above-mentioned operations S2301 to S2306, but also the following operations, by using the plurality of training pictures and the labels of the plurality of training pictures.

In operation S2307, loss values of each of the mask prediction results are calculated based on a mask of any one of the training pictures, the mask prediction results for the training pictures, and the first function.

In operation S2308, a confidence map is transformed based on the loss values of each of the plurality of mask prediction results.

Then, in operation S2309, the obtained confidence map and the first region of interest feature map are input to the recognition network, so that the recognition network outputs the classification prediction result and the outsourcing frame prediction result for the any training picture.

Next, in operation S2310, a loss value of a second function is calculated based on the classification prediction result for the any one of the training pictures, the outsource frame prediction result, and the label of the any one of the training pictures.

For example, the second function may be used to measure the difference between the classification prediction result output by the recognition network and the class of the target object in the actual training picture, and the difference between the outsourcing frame prediction result output by the recognition network and the outsourcing frame of the target object in the actual training picture.

Next, in operation S2311, a parameter identifying the network is adjusted based on the loss value of the second function until the second function achieves convergence.

It can be understood that the input of the recognition network includes a confidence map obtained by converting the mask prediction result and a region-of-interest feature map of the training picture, so that the processing process of the recognition network focuses more on the spatial position distribution of the target object in the training picture. Through iterative optimization of network parameters, when the second function is converged, the difference between the classification prediction result output by the recognition network and the category of the target object in the actual training picture tends to be minimum, and the difference between the outsourcing frame prediction result output by the recognition network and the outsourcing frame of the target object in the actual training picture tends to be minimum. Therefore, the target network model can predict the position and the category of the outsourcing frame of the target object more accurately, and the positioning deviation is avoided.

For example, the mask prediction results are encoded using binary, and C mask prediction results are output using the mask network described above. In combination with the first function, an error loss value of dimension n × n × C can be obtained, the interval of values being [0,1], where n is a positive integer and n × n characterizes the pixel resolution. All loss values are inverted and added with 1, so that the loss values are converted into confidence values, the higher the confidence value is, the higher the prediction accuracy is, and a confidence map with dimension of n multiplied by C is obtained.

Further, according to an embodiment of the present invention, the model training method may further include: before the first interesting area feature map is input into the mask network, the first interesting area feature map is subjected to scale normalization processing. And/or performing scale normalization processing on the first region-of-interest feature map before inputting the first region-of-interest feature map into the identification network.

In this embodiment, a process of performing scale normalization processing on the feature map of the region of interest may be referred to as a region of interest alignment (roilign) process. It should be noted that the alignment process of the region of interest is different from the pooling process of the region of interest (rotopool) in fast R-CNN, and the pooling process of the region of interest maps the region of interest to a corresponding position according to an originally input picture, and quantization (squaring) is required in the mapping process, that is, rounding up is performed, then the region of the original image is divided into different parts according to a set final output size, and quantization is performed again if a decimal is generated during the division, and finally maximum pooling (max pooling) operation is performed on the divided parts. It can be understood that a large amount of spatial position information is lost through multiple quantization in the region of interest pooling process, which results in an error in the mask prediction result at the pixel level. Therefore, in the embodiment, an alignment process of the region of interest is adopted, and the process does not perform quantization operation, but performs bilinear interpolation according to the number of sampling points after the number of the sampling points is set, so as to serve as the value of the corresponding part. The pixel-to-pixel alignment can be achieved without loss of spatial position information.

According to an embodiment of the present invention, the model training method may further include: before inputting the confidence map and the first region-of-interest feature map into the recognition network, performing dimension reduction processing on the confidence map. The inputting the confidence map and the first region of interest feature map into the recognition network includes: and accumulating the confidence map subjected to dimension reduction and the first region-of-interest feature map in the channel dimension to obtain a first feature map to be identified, and inputting the first feature map to be identified into the identification network.

Fig. 4 schematically shows an example structure diagram of an initial network model according to an embodiment of the invention.

As shown in fig. 4, the initial network model includes: a backbone network 410, a regional suggestion network (not shown), a mask network 420, and an attention-based recognition network 430. A training picture is input to the backbone network 410, and the corresponding feature map is extracted from the backbone network. And the candidate target frame generated by the area suggestion network is acted on the feature map to obtain a first region-of-interest feature map. The first region of interest feature map is subjected to scale normalization by a region of interest alignment operation, and then input to the mask network 420 and the recognition network 430, respectively. Mask network 420 is, for example, a full convolution network. Mask network 420 outputs C mask prediction results. In this example, the error loss value of 28 × 28 × C dimension can be obtained by combining the first function, and the interval of the values is [0,1 ]. All loss values are inverted and added with 1, so that the loss values are converted into confidence values, the higher the confidence value is, the higher the prediction accuracy is, and a confidence map with the dimension of 28 x C is obtained.

The confidence map is then used to train the attention mechanism based recognition network 430. Dimension reduction operation can be performed on the confidence map with dimension size of 28 × 28 × C through maximum pooling operation, and an updated confidence map with dimension size of 7 × 7 × C is obtained. And accumulating the updated confidence map and the first region-of-interest feature map subjected to scale normalization processing in a channel dimension to obtain a first feature map to be identified. If the size of the first region-of-interest feature map subjected to the scale normalization process is 7 × 7 × 256, the size of the first feature map to be recognized is 7 × 7 × (256+ C). The first feature to be recognized is input to the recognition network 430 for training, so that the recognition network outputs a classification prediction result and an outsourcing frame prediction result for the training picture.

According to an embodiment of the present invention, a map drawing method is provided. The method is illustrated by the figure below. It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 5 schematically shows a flow chart of a mapping method according to an embodiment of the present invention.

As shown in fig. 5, the method may include operations S510 to S540.

In operation S510, a road picture is acquired.

In operation S520, a target network model is acquired.

The target network model is obtained by training according to the model training method of any one of the above-mentioned embodiments.

Then, in operation S530, the road picture is processed using the target network model such that: the mask network in the target network model outputs a mask of the target object in the road picture, and the attention-based recognition network in the target network model outputs the category and the outsourcing frame of the target object in the road picture.

Next, in operation S540, the target object is drawn in the map based on the mask, the category, and the bounding box of the target object in the road picture.

As described above, the model training method according to the embodiment of the present invention can train to obtain a target network model with better performance, avoid positioning deviation for a target object, and reduce the recognition deviation of the target object. According to the map drawing method provided by the embodiment of the invention, the target object in the road picture is processed by utilizing the target network model obtained by training, so that the mask, the category and the outsourcing frame of the target object in the road picture are more accurately determined, and the positioning offset of the outsourcing frame is avoided. The method can more accurately extract and segment target objects of various types and shapes, and is favorable for drawing various target objects in a high-precision map.

Fig. 6 schematically illustrates an example flowchart of operation S530 in fig. 5, according to an embodiment of the present invention.

As shown in fig. 6, the process of processing the road picture using the target network model in operation S530 may include the following operations.

In operation S5301, the road picture is input to the backbone network of the target network model, so that the backbone network outputs the second feature map for the road picture.

In operation S5302, the road picture is input to the area suggestion network of the target network model so that the area suggestion network outputs a second candidate target frame for the road picture.

Then, in operation S5303, the second feature map of the road picture is input to the mask network, so that the mask network outputs a mask prediction result for the road picture.

In operation S5304, the second candidate target frame is applied to the second feature map and the mask prediction result for the road picture, respectively, to obtain a second region-of-interest feature map and a region-of-interest mask prediction result.

Next, in operation S5305, the second region-of-interest feature map and the region-of-interest mask prediction result are input to the recognition network, so that the recognition network outputs the category of the target object and the outsourcing frame in the road picture.

According to an embodiment of the present invention, the map drawing method further includes: before the second interested area feature map and the interested area mask prediction result are input into the identification network, scale normalization processing is respectively carried out on the second interested area feature map and the interested area mask prediction result. For example, the scale normalization processing procedure for the second region-of-interest feature map and the region-of-interest mask prediction result has the same principle as that for the first region-of-interest scale normalization processing procedure in the foregoing, and is not described herein again. On the basis, the inputting the second region-of-interest feature map and the region-of-interest mask prediction result into the recognition network comprises: and accumulating the second region-of-interest feature map subjected to the scale normalization and the region-of-interest mask prediction result in the channel dimension to obtain a second feature map to be identified for the road picture. And then, inputting the second characteristic diagram to be identified into the identification network, so that the identification network outputs the category and the outsourcing frame of the target object by combining the non-maximum suppression algorithm.

According to an embodiment of the present invention, the map drawing method further includes: and determining a mask prediction result corresponding to the category of the target object from the mask prediction results for the road picture based on the category of the target object in the road picture to serve as the mask of the target object in the road picture.

For example, after a target network model is obtained through training, a road picture is input into the backbone network, and a corresponding feature map is extracted from the backbone network. The regional suggestion network generates candidate target boxes. The extracted feature map is directly input to a mask network so that the mask network outputs a mask prediction result (mask) of dimension n × n × C for the road picture. And applying the candidate target frame to the feature map to obtain a second region-of-interest feature map, and performing scale normalization on the second region-of-interest feature map through a region-of-interest alignment operation. And simultaneously acting the candidate target frame on the mask prediction result to obtain an interested area mask prediction result, and performing scale normalization on the interested area mask prediction result through an interested area alignment operation.

And accumulating the second region-of-interest feature map subjected to the scale normalization and the region-of-interest mask prediction result subjected to the scale normalization in the channel dimension to obtain a second feature map to be identified. The process is the same as the process of accumulating the updated confidence map and the first region-of-interest feature map subjected to the scale normalization processing in the channel dimension, and details are not repeated here. And processing the second characteristic image to be identified by utilizing the identification network and combining a non-maximum suppression algorithm to obtain an outsourcing frame prediction result and a classification prediction result of the road image, taking the outsourcing frame prediction result as an outsourcing frame of the target object in the road image, and taking the classification prediction result as the category of the target object in the road image. And then according to the category of the target object, selecting a mask prediction result corresponding to the category from the mask prediction results to be used as a mask of the target object in the road picture.

The target network model according to the embodiment of the invention is suitable for the identification and extraction of various types of target objects. In general, a slender target object in a road picture is prone to positioning deviation, and the target network model of the embodiment of the invention can more accurately position an outer frame of the target object and divide a mask for the slender target object.

Fig. 7 schematically shows an example of a structure of a road picture according to an embodiment of the present invention. The road picture 700 includes various types of lane marking lines, and in order to identify and segment the various lane marking lines from the road picture, the various lane marking lines are used as target objects for labeling during training of the target network model. By processing the road picture 700 by using the target network model according to the embodiment of the present invention, the outsourcing frame 710, the mask 720 and the category (such as category 1, category 2 and category 3 in fig. 7) of the lane marking included in the road picture 700 can be identified. Thereby accurately dividing the lane marking line. Therefore, the lane marking line can be accurately positioned and drawn in the map, and drawing errors caused by recognition deviation are avoided.

FIG. 8 schematically shows a block diagram of a model training apparatus according to an embodiment of the present invention.

As shown in fig. 8, the model training apparatus 800 may include: a construction module 810, a sample acquisition module 820, and a training module 830.

The building module 810 is used to build an initial network model. The initial network model includes: a backbone network, a regional suggestion network, a mask network, and an attention-based recognition network. The backbone network is used for extracting features of the input picture to obtain a feature map. The regional suggestion network is used to generate candidate target boxes. The mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame. And the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the characteristic diagram, the candidate target frame and the mask prediction result.

The sample acquiring module 820 is used for acquiring a plurality of training pictures and labels of the training pictures. Wherein, the label of any training picture in a plurality of training pictures that acquire includes: the category, the bounding box, and the mask of the target object in any of the training pictures.

The training module 830 is configured to train the initial network model by using the training pictures and the labels of the training pictures to obtain a target network model.

Fig. 9 schematically shows a block diagram of a mapping apparatus according to an embodiment of the present invention.

As shown in fig. 9, the mapping apparatus 900 may include: a first obtaining module 910, a second obtaining module 920, a processing module 930, and a rendering module 940.

The first obtaining module 910 is configured to obtain a road picture.

The second obtaining module 920 is configured to obtain a target network model obtained by training according to the model training method described in any of the above embodiments.

The processing module 930 is configured to process the road picture by using the target network model, so that: the mask network in the target network model outputs a mask of the target object in the road picture, and the attention-based recognition network in the target network model outputs the category and the outsourcing frame of the target object in the road picture.

The drawing module 940 is configured to draw the target object in the map based on the mask, the category, and the bounding box of the target object in the road picture.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the invention may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present invention may be implemented by being divided into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present invention may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present invention may be at least partially implemented as computer program modules, which, when executed, may perform the corresponding functions.

For example, any of the building module 810, the sample obtaining module 820, and the training module 830 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the building module 810, the sample acquiring module 820 and the training module 830 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware and firmware. Alternatively, at least one of the construction module 810, the sample acquisition module 820 and the training module 830 may be at least partially implemented as a computer program module, which when executed, may perform corresponding functions.

FIG. 10 schematically illustrates a block diagram of a computer device suitable for implementing the model training method and/or mapping method described above, in accordance with an embodiment of the present invention. The computer device shown in fig. 10 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 10, a computer apparatus 1000 according to an embodiment of the present invention includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. Processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1001 may also include onboard memory for caching purposes. The processor 1001 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.

In the RAM 1003, various programs and data necessary for the operation of the apparatus 1000 are stored. The processor 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 1002 and/or the RAM 1003. Note that the programs may also be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.

According to an embodiment of the invention, device 1000 may also include an input/output (I/O) interface 1005, with input/output (I/O) interface 1005 also being connected to bus 1004. Device 1000 may also include one or more of the following components connected to I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

According to an embodiment of the invention, the method flow according to an embodiment of the invention may be implemented as a computer software program. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable storage medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the system of the embodiment of the present invention when executed by the processor 1001. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.

An embodiment of the present invention further provides a computer-readable storage medium, which may be included in the apparatus/device/system described in the foregoing embodiment; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the model training method and/or the mapping method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present invention, a computer-readable storage medium may include the ROM 1002 and/or the RAM 1003 described above and/or one or more memories other than the ROM 1002 and the RAM 1003.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be appreciated by a person skilled in the art that various embodiments of the invention and/or features recited in the claims may be combined and/or coupled in a number of ways, even if such combinations or couplings are not explicitly recited in embodiments of the invention. In particular, various combinations and/or combinations of features recited in the various embodiments and/or claims of the embodiments may be made without departing from the spirit and teachings of the embodiments. All such combinations and/or associations are within the scope of embodiments of the present invention.

The embodiments of the present invention have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the embodiments of the present invention. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of embodiments of the invention is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the embodiments of the present invention, and these alternatives and modifications are intended to fall within the scope of the embodiments of the present invention.

Claims

1. A model training method, comprising:

constructing an initial network model, wherein the initial network model comprises: the system comprises a backbone network, an area suggestion network, a mask network and an identification network based on an attention mechanism, wherein the backbone network is used for extracting features of an input picture to obtain a feature map, the area suggestion network is used for generating a candidate target frame, the mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame, and the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the feature map, the candidate target frame and the mask prediction result;

obtaining a plurality of training pictures and labels of the training pictures, wherein the labels of any training picture in the training pictures comprise: the category, the outer frame and the mask of the target object in any training picture; and

and training the initial network model by utilizing the plurality of training pictures and the labels of the plurality of training pictures to obtain a target network model.

2. The method of claim 1, wherein the training the initial network model using the plurality of training pictures and the labels of the respective plurality of training pictures comprises: for any of the plurality of training pictures,

inputting any training picture into the backbone network so that the backbone network outputs a first feature map for the any training picture;

inputting the any training picture into the area suggestion network to enable the area suggestion network to output a first candidate target box aiming at the any training picture;

applying the first candidate target frame to the first feature map to obtain a first region-of-interest feature map;

inputting the first region-of-interest feature map into the mask network so that the mask network outputs a mask prediction result for the any training picture;

calculating a loss value of a first function based on the mask prediction result for the any training picture and the label of the any training picture; and

and adjusting parameters of the backbone network, the area recommendation network and/or the mask network based on the loss value of the first function until the first function is converged.

3. The method of claim 2, wherein the mask predictor for the any one training picture comprises a plurality of mask predictors;

the calculating a loss value of a first function based on the mask prediction result of the any one training picture and the label of the any one training picture comprises:

determining the category and the mask of any training picture according to the label of any training picture; and

calculating a loss value of a mask prediction result corresponding to the category of the any training picture in the plurality of mask prediction results based on the mask of the any training picture and the first function.

4. The method of claim 3, wherein the training the initial network model with the plurality of training pictures and the labels of the respective plurality of training pictures further comprises:

calculating a loss value of each of the plurality of mask prediction results based on the mask of any one of the training pictures, the plurality of mask prediction results, and the first function;

converting to obtain a confidence map based on the loss values of the mask prediction results;

inputting the confidence map and the first region-of-interest feature map into the recognition network so that the recognition network outputs a classification prediction result and an outsourcing frame prediction result for any training picture;

calculating a loss value of a second function based on the classification prediction result for the any training picture, the outsourcing frame prediction result and the label of the any training picture; and

and adjusting the parameters of the identified network based on the loss value of the second function until the second function is converged.

5. The method of claim 4, wherein converting the confidence map based on the loss values for each of the plurality of mask predictors comprises:

for any mask prediction result in the multiple mask prediction results, negating the loss value of each pixel in the mask prediction result and adding 1 to obtain a confidence value of each pixel; and

determining a confidence map for the any mask predictor based on the confidence value for each pixel in the any mask predictor.

6. The method of claim 4, further comprising:

before the first region-of-interest feature map is input into the mask network, carrying out scale normalization processing on the first region-of-interest feature map; and/or

And before the first region-of-interest feature map is input into the identification network, carrying out scale normalization processing on the first region-of-interest feature map.

7. The method of claim 4, further comprising: before inputting the confidence map and the first region-of-interest feature map into the recognition network, performing dimension reduction processing on the confidence map;

the inputting the confidence map and the first region of interest feature map to the recognition network comprises:

accumulating the confidence map subjected to dimension reduction processing and the first region-of-interest feature map in a channel dimension to obtain a first feature map to be identified; and

and inputting the first characteristic diagram to be recognized into the recognition network.

8. A map drawing method, comprising:

acquiring a road picture;

obtaining a target network model obtained by training according to the model training method of any one of claims 1-7;

processing the road picture by using the target network model so as to: a mask network in the target network model outputs a mask of a target object in the road picture, and an attention-based recognition network in the target network model outputs a category and an outsourcing frame of the target object in the road picture; and

and drawing the target object in a map based on the mask, the category and the outer enclosure of the target object in the road picture.

9. The method of claim 8, wherein the processing the road picture with the target network model comprises:

inputting the road picture into a backbone network of the target network model so that the backbone network outputs a second feature map for the road picture;

inputting the road picture into a region suggestion network of the target network model to enable the region suggestion network to output a second candidate target frame aiming at the road picture;

inputting a second feature map of the road picture into the mask network so that the mask network outputs a mask prediction result for the road picture;

respectively acting the second candidate target frame on the second feature map and the mask prediction result aiming at the road picture to obtain a second region-of-interest feature map and a region-of-interest mask prediction result; and

and inputting the second region-of-interest feature map and the region-of-interest mask prediction result into the identification network, so that the identification network outputs the category and the outsourcing frame of the target object in the road picture.

10. The method of claim 9, further comprising: before the second region-of-interest feature map and the region-of-interest mask prediction result are input into the identification network, respectively performing scale normalization processing on the second region-of-interest feature map and the region-of-interest mask prediction result;

the inputting the second region of interest feature map and the region of interest mask prediction result into the recognition network comprises:

accumulating the second region-of-interest feature map subjected to scale normalization and the region-of-interest mask prediction result in a channel dimension to obtain a second feature map to be identified for the road picture; and

inputting the second feature map to be identified into the identification network, so that the identification network outputs the category and the bounding box in combination with a non-maximum suppression algorithm.

11. The method of claim 9, further comprising:

determining a mask prediction result corresponding to the category of the target object from the mask prediction results for the road picture as a mask of the target object in the road picture based on the category of the target object in the road picture.

12. The method of claim 1, wherein the target object comprises: the lane marking line.

13. A model training apparatus comprising:

a building module configured to build an initial network model, wherein the initial network model comprises: the system comprises a backbone network, an area suggestion network, a mask network and an identification network based on an attention mechanism, wherein the backbone network is used for carrying out feature extraction on an input image to obtain a feature map, the area suggestion network is used for generating a candidate target frame, the mask network is used for obtaining a mask prediction result based on the feature map and the candidate target frame, and the identification network is used for obtaining a classification prediction result and an outsourcing frame prediction result based on the feature map, the candidate target frame and the mask prediction result;

a sample obtaining module, configured to obtain a plurality of training pictures and labels of the training pictures, where a label of any one of the training pictures includes: the category, the outer frame and the mask of the target object in any training picture; and

and the training module is used for training the initial network model by utilizing the plurality of training pictures and the labels of the plurality of training pictures to obtain a target network model.

14. A map rendering apparatus comprising:

the first acquisition module is used for acquiring a road picture;

a second obtaining module, configured to obtain a target network model obtained through training according to the model training method of any one of claims 1 to 7;

a processing module, configured to process the road picture by using the target network model, so that: a mask network in the target network model outputs a mask of a target object in the road picture, and an attention-based recognition network in the target network model outputs a category and an outsourcing frame of the target object in the road picture; and

and the drawing module is used for drawing the target object in the map based on the mask, the category and the outer enclosure of the target object in the road picture.

15. A computer device, comprising:

a memory having computer instructions stored thereon; and

at least one processor;

wherein the processor, when executing the computer instructions, implements the method of any of claims 1-12.

16. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1-12.