CN115775367A

CN115775367A - Road target detection method, detection device, electronic equipment and storage medium

Info

Publication number: CN115775367A
Application number: CN202211373265.7A
Authority: CN
Inventors: 杨哲; 王亚军; 李�瑞; 王邓江; 马冰
Original assignee: Suzhou Wanji Iov Technology Co ltd
Current assignee: Suzhou Wanji Iov Technology Co ltd
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-03-10

Abstract

The application discloses a road target detection method, a detection device, an electronic device and a storage medium, wherein the road target detection method comprises the following steps: acquiring a current road video; inputting a current road video to a target road target detection model to obtain corresponding road information, wherein the road information at least comprises a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection network model. The target road target detection model constructed according to the RepMVGG network model and the YOLO target detection network model is realized, the representation capability of the model is favorably improved, the target road target detection model is used for detecting the road target of the current road video, and the detection precision of detecting the road target is improved.

Description

Road target detection method, detection device, electronic device and storage medium

Technical Field

The present application belongs to the field of road detection technology, and in particular, to a road target detection method, a detection apparatus, an electronic device, and a storage medium.

Background

Automobiles have become the main transportation means of modern society, and bring much convenience to people; however, with the increasing number of automobiles, the traffic problems caused by the automobiles are increasing, such as urban congestion and traffic accidents; therefore, it is important to detect road objects (e.g., vehicles, pedestrians, etc.), and to perform vehicle diversion, speed limitation, and traffic light setting according to the detection result.

At present, the collected road image is mainly detected by a fast Rcnn model (fast Regions with CNN Features, fast Rcnn), a Mask Rcnn model (Mask Rcnn) SSD network (SSD) or a YOLO target detection model (YOLO) because the network model structures of the fast Rcnn model, the Mask Rcnn model, the SSD network and the YOLO target detection model are relatively simple and cannot effectively learn the characteristic information in a complex traffic scene, resulting in lower detection accuracy of the road target detection by the fast Rcnn model, the Mask Rcnn model, the SSD network or the YOLO target detection model.

Disclosure of Invention

In view of the above, embodiments of the present application provide a road target detection method, a detection apparatus, an electronic device, and a storage medium, so as to overcome or at least partially solve the above problems in the prior art.

In a first aspect, an embodiment of the present application provides a road target detection method, including: acquiring a current road video; inputting a current road video to a target road target detection model to obtain corresponding road information, wherein the road information at least comprises a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection model.

In some optional embodiments, before inputting the current road video to the road target detection model and obtaining the corresponding road information, the road target detection method further includes: acquiring a corresponding sample set according to the historical road video, wherein the sample set at least comprises a training set; obtaining a target road target detection model according to the RepVGG model, the YOLO target detection model and the training set; determining whether the target road target detection model is converged; inputting a current road video to a target road target detection model to obtain corresponding road information, wherein the method comprises the following steps: and when the target road target detection model is determined to be converged, inputting the current road video to the target road target detection model to obtain corresponding road information.

In some optional embodiments, obtaining a target road target detection model according to the RepVGG model, the YoLO target detection model, and the training set includes: fusing a RepVGG structure in the RepVGG model, a CSP structure of the YOLO target detection model and an SPP structure of the YOLO target detection model to obtain an initial road target detection model; and inputting the training set into the initial road target detection model for training to obtain a target road target detection model.

In some optional embodiments, the method for detecting a road target, before inputting the training set to the initial road target detection model for training and obtaining the target road target detection model, further includes: fusing a short-term dense cascade STDC structure and an initial road target detection model to obtain a current road target detection model; deleting the CSP structure in the current road target detection model to obtain an updated road target detection model; inputting the training set into an initial road target detection model for training to obtain a target road target detection model, wherein the method comprises the following steps: inputting the training set into an updated road target detection model for training to obtain a training result; inputting the training result into the RIOU loss function to obtain a corresponding loss value; and obtaining a target road target detection model through a back propagation algorithm according to the loss value.

In some optional embodiments, the updated road target detection model includes a first CBL convolution module, a first Stage module, a second CBL convolution module, a second Stage module, a third CBL convolution module, a third Stage module, a fourth CBL convolution module, a fourth Stage module, and a Head detection Head module of the YOLO target detection model, and the updated road target detection model uses a shortcut structure; the first Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layers and 1 CBL convolutional layer which are connected in sequence, the second Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layers and 1 CBL convolutional layer which are connected in sequence, the third Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 1 SPP layer, 2 RepVGG layers and 1 CBL convolutional layer which are connected in sequence, and the fourth Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layers and 1 CBL convolutional layer which are connected in sequence.

In some optional embodiments, obtaining a corresponding sample set according to the historical road video includes: extracting frames from the historical road video to obtain frame images; labeling the frame image to obtain a corresponding labeled image; and dividing the marked image according to a preset division rule to obtain a corresponding sample set.

In some optional embodiments, the road target detection method further includes: inputting road information to a self-adaptive non-maximum suppression algorithm module, so that the self-adaptive non-maximum suppression algorithm module outputs corresponding target road information, wherein the target road information is maximum information in the road information; and receiving the target road information output by the self-adaptive non-maximum suppression algorithm module.

In a second aspect, an embodiment of the present application provides a road target detection device, which includes a current video acquisition module and a video input module. The current video acquisition module is used for acquiring a current road video; the video input module is used for inputting a current road video to the target road target detection model and obtaining corresponding road information, wherein the road information at least comprises a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection model.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs being configured to perform the road object detection method as provided in the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which program codes are stored, and the program codes can be called by a processor to execute the road object detection method as provided in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a computer device, causes the computer device to execute the road object detection method as provided in the first aspect above.

According to the scheme, the corresponding road information is obtained by obtaining the current road video and inputting the current road video to the target road target detection model, the road information at least comprises the road target type and the road target position, and the target road target detection model is obtained based on the RepVGG model and the YOLO target detection model, so that the target road target detection model constructed according to the RepVGG model and the YOLO target detection model is realized, the representation capability of the model is favorably improved, the target road target detection model is used for carrying out road target detection on the current road video, and the detection precision for detecting the road target is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 shows a scene schematic diagram of a road object detection system according to an embodiment of the present application.

Fig. 2 is a schematic flowchart illustrating a road object detection method according to an embodiment of the present application.

Fig. 3 shows another schematic flow chart of a road object detection method provided in the embodiment of the present application.

Fig. 4 shows a schematic structural diagram of a RepVGG model in the road target detection method provided in the embodiment of the application.

Fig. 5 shows another structural schematic diagram of the RepVGG model in the road target detection method provided in the embodiment of the application.

Fig. 6 shows a schematic structural diagram of an STDC structure in a road object detection method according to an embodiment of the present application.

Fig. 7 shows a schematic structural diagram of a target RepVGG-STDC-YOLO model in the road target detection method provided by the embodiment of the application.

Fig. 8 shows a block diagram of a road object detection device according to an embodiment of the present application.

Fig. 9 shows a functional block diagram of an electronic device provided in an embodiment of the present application.

Fig. 10 illustrates a computer-readable storage medium provided in an embodiment of the present application for storing or carrying program codes for implementing a road object detection method provided in an embodiment of the present application.

Fig. 11 illustrates a computer program product provided in an embodiment of the present application for storing or carrying program codes for implementing a road object detection method provided in an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In addition, in the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Referring to fig. 1, a schematic view of an application scenario of a road object detection system provided in an embodiment of the present application is shown, which may include a road 100, a video capture device 200, and a processing device 300. The video capturing apparatus 200 is installed in the road 100 and is used to capture a road video of the road 100. The video capture device 200 is connected to the processing device 300 via a network and performs data interaction with the processing device 300 via the network.

The video capture device 200 may be a laser radar, a video camera, or the like, which is not limited herein.

The Network may be any one of a ZigBee (ZigBee) Network, a Bluetooth (BT) Network, a Wi-Fi (Wireless Fidelity, wi-Fi) Network, a home Internet of Things communication protocol (Thread) Network, a Long Range Radio (LoRa) Network, a Low Power Wide Area Network (LPWAN), an infrared Network, a Narrow Band Internet of Things (NB-IoT), a Controller Area Network (Controller Area Network, CAN), a Digital Living Network Alliance (Digital Living Network Alliance, na) Network, a Wide Area Network (WAN ), a Local Area Network (Local Area Network, MAN), a Metropolitan Area Network (Metropolitan Area Network, MAN), or a Personal Area Network (Wireless Local Area Network, WPAN), and the like.

The processing device 300 may be a server or a terminal device, and the like, which is not limited herein, and may be specifically configured according to actual requirements.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), big data, an artificial intelligence platform, and the like.

The terminal device may be a mobile terminal device (e.g., a vehicle-mounted terminal, a Personal Digital Assistant (PDA), a Tablet Personal Computer (Tablet pc), a notebook Computer, etc.), or a fixed terminal device (e.g., a desktop Computer, an intelligent panel, etc.), etc.

Referring to fig. 2, a flowchart of a road object detection method according to an embodiment of the present application is shown. In a specific embodiment, the road object detection method may be applied to the processing device 300 in the road object detection system shown in fig. 1, and the flow shown in fig. 2 is described in detail below by taking the processing device 300 as an example, and the road object detection method may include the following steps S110 to S130.

Step S110: and acquiring a current road video.

In the embodiment of the application, the video acquisition equipment is used for acquiring road videos of roads, the acquired road videos are sent to the processing equipment through the network, and the processing equipment receives the road videos sent by the video acquisition equipment through the network.

When receiving a detection instruction for detecting a road target, the processing equipment can send an acquisition instruction to the video acquisition equipment through the network, the video acquisition equipment receives and responds to the acquisition instruction, acquires a current road video of the road, sends the current road video to the processing equipment through the network, and receives the current road video returned by the video acquisition equipment.

In some embodiments, the processing device may detect a user operation, and receive a detection instruction to detect a road object when it is determined that a detection instruction to detect a road object is input by a user according to the detected user operation. For example, when a user needs to detect a road target, touch operation may be performed on an operation panel of the processing device, the processing device responds to the touch operation of the user to generate a corresponding touch signal and analyze the touch signal, and when it is determined that the touch signal is a preset detection signal, it is determined that a detection instruction for detecting the road target is received.

In some embodiments, the processing device may be provided with a voice recognition module, when a user needs to detect a road target, the user may send voice information within a voice acquisition range of the voice recognition module, the voice recognition module acquires the voice information sent by the user, performs voice recognition on the acquired voice information, and determines, according to a recognition result of the voice recognition, that a keyword used for indicating to detect the road target, for example, "detect the road target", is included in the recognition result, and for example, "detect the road target" and "detect", and then determines to receive a detection instruction to detect the road target.

As an example, the voice information uttered by the user is: and detecting the road target, wherein the recognition result of the voice recognition contains a keyword 'road target detection', and then the detection instruction for detecting the road target is determined to be received.

Step S130: and inputting the current road video to the target road target detection model to obtain corresponding road information.

In the embodiment of the application, after the processing device obtains the current road video, the processing device may input the current road video to the target road target detection model, the target road target detection model receives and responds to the current road video, detects the current road video, and outputs road information corresponding to the current road video to the processing device, the processing device receives the road information output by the target road target detection model, the road information at least includes a road target type and a road target position, the target road target detection model is obtained based on the RepVGG model and the YOLO target detection model, the target road target detection model constructed according to the RepVGG model and the YOLO target detection model is realized, the representation capability of the model is favorably improved, the target road target detection model is used for detecting the road target of the current road video, and the detection precision of the road target is improved.

In some embodiments, after the processing device inputs the current road video to the target road target detection model and obtains corresponding road information, the processing device may input the road information to an Adaptive Non-Maximum Suppression Algorithm (ANMS), the ANMS receives and responds to the road information, and outputs the corresponding target road information to the processing device, and the processing device receives the target road information output by the ANMS, where the target road information is Maximum information in the road information, so that accuracy of the output target road information is ensured, and accuracy of detecting the road target is improved. The ANMS algorithm module is used for inhibiting the target road target detection model from outputting non-maximum road information.

According to the scheme provided by the embodiment, the current road video is obtained, the current road video is input to the target road target detection model, the corresponding road information is obtained, the road information at least comprises the road target type and the road target position, and the target road target detection model is obtained based on the RepVGG model and the YOLO target detection model, so that the target road target detection model constructed according to the RepVGG model and the YOLO target detection model is realized, the representation capability of the model is favorably improved, the target road target detection model is used for detecting the road target of the current road video, and the detection precision for detecting the road target is improved.

Referring to fig. 3, a flowchart of a road object detection method according to another embodiment of the present application is shown. In a specific embodiment, the road object detection method may be applied to the processing device 300 in the road object detection system shown in fig. 1, and the flow shown in fig. 3 is described in detail below by taking the processing device 300 as an example, and the road object detection method may include the following steps S210 to S290.

Step S210: and acquiring a current road video.

In this embodiment, the step S210 may refer to the content of the corresponding step in the foregoing embodiments, and is not described herein again.

Step S230: and acquiring a corresponding sample set according to the historical road video.

In this embodiment, the processing device may obtain a sample set corresponding to the historical road video according to the historical road video. The historical road video is the road video acquired by the video acquisition equipment at the historical moment, the sample set can comprise a training set and a test set, the training set can be used for training a network model for detecting a road target, and the test set can be used for testing the trained network model and determining whether to stop training the network model according to a test result.

The test result comprises a convergence result and an unconvergence result; when the test result is a convergence result, stopping training the network model; and when the test result is an unconverged result, continuing to train the network model.

Specifically, the processing device may perform frame extraction processing on the historical road video to obtain a frame image, label the frame image to obtain a labeled image corresponding to the frame image, and divide the labeled image according to a preset division rule to obtain a sample set corresponding to the labeled image. In addition, the processing device may label all frame images corresponding to the historical road video to obtain corresponding labeled images, and divide the labeled images according to preset division rules to obtain sample sets corresponding to the labeled images.

The preset partition rule may be an artificial rule, for example, the preset partition rule may be a training set: test set =7, and when the sample set contains 12000 annotation images, the training set is 7000 and the test set is 5000; the preset partition rule may also be a training set: test set =1, and when the sample set contains 12000 labeled images, the training set is 6000 and the test set is 6000. The type of the preset division rule is not limited herein, and the preset division rule may be specifically set according to actual requirements.

In this embodiment, the historical road video may be a road video acquired under multiple acquisition conditions, where the multiple acquisition conditions are at least any one of multiple roads, multiple monitoring angles, multiple time periods, multiple weathers (e.g., sunny days, rainy days, or foggy days), and the like, and a corresponding sample set is acquired according to the road video acquired under the multiple acquisition conditions, so that it is ensured that a detection model trained according to the sample set has generalization, and detection accuracy of the detection model is ensured.

When the processing device labels the frame image, the processing device mainly labels the road target type and the road target position in the frame image, for example, labels the road target type in the frame image with a label frame, and labels the center point position of the label frame, where the center point position of the label frame is the road target position corresponding to the road target type.

For example, road object types may include pedestrians, bicycles, a take-away vehicles, B take-away vehicles, C take-away vehicles, regular tricycles, express tricycles, cars, buses, pickup trucks, large trucks, and the like.

Step S250: and obtaining a target road target detection model according to the RepVGG model, the YOLO target detection model and the training set.

In this embodiment, after the processing device obtains the corresponding sample set according to the historical road video, the processing device may obtain the target road target detection model according to the RepMVGG model, the YoLO target detection model, and the training set.

The RepVGG network model is a simple and powerful CNN structure, a multi-branch model with high use performance is used when the RepVGG model is trained, as shown in FIG. 4, a simple model with high use speed and memory saving is used during reasoning, as shown in FIG. 5, so that the RepVGG model has balance of speed and precision.

When the RepVGG model is trained, the identity and 1 multiplied by 1 bridges are respectively added into the RepVGG Block, abundant branch combinations are beneficial to improving the representation capability of the model, and meanwhile, model overfitting is avoided due to the introduction of residual errors, so that the RepVGG model is easier to converge. In the model reasoning stage, all network layers are converted into 3 x 3 convolution through a fusion strategy, so that the deployment and acceleration of the model are facilitated. In the fusion process, the convolution layer and the BN layer in the residual block are fused, and the weight and parameter bias after fusion are as follows:

w is the weight of the original convolutional layer, and W' is the weight of the convolutional layer after fusion;

mu is the average value of BN layer, sigma is the variance of BN layer, and gamma is the scale factor of BN layerβ is the offset factor of the BN layer, and b' is the bias after fusion.

In some embodiments, after obtaining the corresponding sample set according to the historical road video, the processing device may fuse a CSP structure of the ReVGG model, a CSP structure of the YoLO target detection model, and an SPP structure of the YoLO target detection model to obtain an initial road target detection model (ReVGG-YoLO model), and input the training set to the initial road target detection model for training to obtain a target road target detection model.

In some embodiments, after acquiring a corresponding sample set according to a historical road video, the processing device may fuse a RepVGG model, a CSP structure of a YOLO target detection model, and an SPP structure of the YOLO target detection model to obtain an initial road target detection model, fuse a short-term dense cascade STDC structure in a BiSeNet semantic segmentation model and the initial road target detection model to obtain a current road target detection model (initial RepVGG-STDC-YOLO model), delete the CSP structure in the current road target detection model, obtain an updated road target detection model (target RepVGG-STDC-YOLO model), input a training set to the updated road target detection model for training to obtain the target road target detection model, combine low-level features with high-level features due to the addition of the dense cascade STDC structure, enrich feature information of a network, and obtain a model expression performance of the updated road target detection model when the CSP structure in the current road target detection model is deleted while simplifying the network structure in the updated road target detection model.

As shown in fig. 6, a schematic diagram of an STDC network model is shown. The STDC network model comprises a first feature module, a Block1, a second feature module, a Block2, a third feature module, a Block3, a fourth feature module, a Block4 layer, a fifth feature module, an average pooling module (AVG Pool module), a splicing module (Concat module) and a sixth feature module. The first feature module is an input layer and comprises N feature map channels.

Block1, block2, block3, and Block4 all represent inputs that have been sequentially subjected to convolution operations, BN batch regularization operations, and Relu activation function calculations. Each Block ensures that the output size of the feature diagram is not changed, the number of feature diagram channels output by the

blocks

1, 2 and 3 is reduced by half, and the number of feature diagram channels output by the Block4 (namely the number of feature diagram channels of the fifth feature module) is the same as that of the fourth feature module.

The Block1 comprises Conv1 layer 1, the first feature module obtains a second feature module after passing through the Block1, and the second feature module comprises N/2 feature diagram channels. Block2 comprises Conv3 layers, the second feature module obtains a third feature module after passing through Block2, and the third feature module comprises N/4 feature diagram channels. Block3 comprises Conv3 × 3 layers, the third feature module obtains a fourth feature module after passing through Block3, and the fourth feature module comprises N/8 feature diagram channels. And the Block4 comprises Conv3 layers, the fourth feature module obtains a fifth feature module after passing through the Block4, and the fifth feature module comprises N/8 feature diagram channels.

The second characteristic module is input into the Concat module after passing through the AVG Pool module, and the AVG Pool module is a 3 × 3 pooling layer. And the Concat module fuses the output of the AVG Pool module, the third characteristic module, the fourth characteristic module and the fifth characteristic module to obtain a sixth characteristic module, and the sixth characteristic module comprises N characteristic diagram channel numbers.

The number of deep feature map channels (the number of feature map channels of the fourth feature module and the number of feature map channels of the fifth feature module) of the STDC network model is small, the number of shallow feature map channels (the number of feature map channels of the first feature module, the number of feature map channels of the second feature module and the number of feature map channels of the third feature module) is large, the shallow network of the STDC network model pays attention to feature coding detail information, the deep network pays attention to high-level information, and the excessive number of feature map channels in the deep network can cause information redundancy. In the STDC network model, the number of characteristic diagram channels is gradually reduced along with the deepening of the network layer number, and the calculation amount of the STDC network model is reduced.

The SPP structure of the YOLO target detection model can output feature maps with fixed sizes for images with different sizes, and the receptive field of the network can be increased through spatial pyramid pooling.

In an application scenario, as shown in fig. 7, a schematic structural diagram of the target RepVGG-STDC-YOLO model is shown. The target ReVGG-STDC-YOLO model comprises a first CBL convolution module, a first Stage module (Stage 1), a second CBL convolution module, a second Stage module (Stage 2), a third CBL convolution module, a third Stage module (Stage 3), a fourth CBL convolution module, a fourth Stage module (Stage 4) and a Head detection Head module in the YOLO target detection model which are connected in sequence.

The first CBL convolution module comprises 2 CBL convolution layers, 1 STDC layer and 1 CBL convolution layer in sequence, the second CBL convolution module comprises 2 CBL convolution layers, the third CBL convolution module comprises 2 CBL convolution layers, and the fourth CBL convolution module comprises 2 CBL convolution layers. Wherein, CBL convolutional layer represents including convolutional layer, BN layer and LeakRelu active layer.

The Stage1 module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, the Stage2 module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, the Stage3 module comprises 1 CBL convolutional layer, 1 STDC layer, 1 SPP layer, 2 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, and the Stage4 module comprises 1 CBL convolutional layer, 1 STDC layer, 1 VGRepG layer and 1 CBL convolutional layer which are connected in sequence.

The target ReVGG-STDC-YOLO model uses a shortcut structure, low-level features and high-level features are combined, feature information of a network is enriched, meanwhile, the STDC structure is adopted, feature graphs of a plurality of continuous layers are connected, the calculated amount of the target ReVGG-STDC-YOLO model is greatly reduced by reducing the dimension of the feature graphs, and the performance of the target ReVGG-STDC-YOLO model in target detection is also ensured by the combination of the features of the plurality of network layers.

In some embodiments, after obtaining the corresponding sample set according to the historical road video, the processing device may fuse the CSP structure of the RepVGG model, the YOLO target detection model, and the SPP structure of the YOLO target detection model to obtain a RepVGG-YOLO model, fuse the short-term dense cascade STDC structure and the RepVGG-YOLO model to obtain an initial RepVGG-STDC-YOLO model, delete the CSP structure in the initial RepVGG-STDC-YOLO model to obtain a target RepVGG-STDC-YOLO model, input the training set into the target RepVGG-STDC-YOLO model for training to obtain a training result, input the training result into the RIOU loss function to obtain a corresponding loss value, and perform iterative training on the target RepVGG-STDC-YOLO model through a back propagation algorithm according to the loss value to obtain the target road target detection model. When using RIOU training, the gradients of a large number of simple samples (IOU large) are increased, giving the network more attention to these samples, while the gradients of a small number of difficult samples (IOU small) are suppressed. This will make the contribution of each type of sample more balanced and the training process more efficient and stable.

The RIOU loss function is a target frame loss function and can be calculated according to the intersection ratio IOU of the prediction frame and the real frame and the following formula.

RIOU =0.5 × (IOU + U/C), where IOU = I/U, I is the intersection of the prediction box and the real box, U is the union of the prediction box and the real box, and C is the smallest rectangle that can enclose the prediction box and the real box.

Step S270: and determining whether the target road target detection model converges.

In this embodiment, the sample set may further include a test set, and in order to obtain a stable target road target detection model, the processing device may determine whether the target road target detection model converges according to the test set. Specifically, the processing device may input a test set to the target road target detection model, the target road target detection model receives and responds to the test set, tests the test set to obtain the prediction information, and outputs the prediction information to the processing device, the processing device receives the prediction information output by the target road target detection model, determines a difference value between the real information of the test set and the prediction information, and determines whether the target road target detection model converges according to the difference value.

When the plurality of difference values are in a preset range, the plurality of difference values are stable, and then the target road target detection model is determined to be converged; and when the difference value which is not in the preset range exists in the plurality of difference values, the plurality of difference values are not stable, and the target road target detection model is determined not to be converged.

The preset range may be a difference value range preset by a user, or may be a difference value range automatically generated by a processing device according to a training process of training an initial road target detection model for multiple times, and the like, which is not limited herein.

In some embodiments, in order to improve training efficiency of training the target RepVGG-STDC-YOLO model, the processing device may obtain the target road target detection model after training the target RepVGG-STDC-YOLO model for a preset number of times, input the test set into the target road target detection model, receive and respond to the test set by the target road target detection model, test the test set to obtain the prediction information, and output the prediction information to the processing device, receive the prediction information output by the target road target detection model by the processing device, determine a difference value between real information and the prediction information of the test set, and determine whether the target road target detection model converges according to the difference value.

The preset times may be times preset by a user, or times generated automatically in a training process of training the target RepVGG-STDC-YOLO model by the processing device for multiple times, and the like, and the preset times are not limited herein and may be specifically set according to actual requirements.

As an example, the preset number may be 300 times, the processing device may obtain the target road target detection model after 300 times of training of the target ReVGG-STDC-YOLO model, input the test set into the target road target detection model, receive and respond to the test set by the target road target detection model, test the test set, obtain the prediction information, and output the prediction information to the processing device, receive the prediction information output by the target road target detection model, determine a difference value between true information and the prediction information of the test set, and determine whether the target road target detection model converges according to the difference value.

Step S290: and when the target detection model of the target road is determined to be converged, inputting the current road video to the target detection model of the target road to obtain corresponding road information.

In this embodiment, step S290 may refer to the content of the corresponding step in the foregoing embodiments, and is not described herein again.

According to the scheme provided by the embodiment, the current road video is obtained, the corresponding sample set is obtained according to the historical road video, the target road target detection model is obtained according to the RepVGG model, the YOLO target detection model and the training set, whether the target road target detection model is converged or not is determined, and when the target road target detection model is determined to be converged, the current road video is input to the target road target detection model to obtain the corresponding road information.

Further, whether the target road target detection model is converged is determined according to the test set, and when the target road target detection model is determined to be converged according to the test set, road target detection is performed on the current road video, so that the stability of the target road target detection model for detecting the road video is ensured, and the detection reliability of the road target is improved.

Referring to fig. 8, which illustrates a road object detection device 300 according to an embodiment of the present application, the road object detection device 300 may be applied to a processing device 300 in the road object detection system shown in fig. 1, and the road object detection device 300 shown in fig. 8 is described in detail below by taking the processing device 300 as an example, and the road object detection device 300 may include a current video obtaining module 310 and a video input module 330.

The current video obtaining module 310 may be configured to obtain a current road video; the video input module 330 may be configured to input a current road video to a target road target detection model, to obtain corresponding road information, where the road information at least includes a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection model.

In some embodiments, the road target detection device 300 may further include a sample acquisition module, an obtaining module, and a determining module.

The sample acquisition module may be configured to input the current road video to the target road target detection model by the video input module 330, and acquire a corresponding sample set according to the historical road video before acquiring corresponding road information, where the sample set at least includes a training set; the obtaining module can be used for obtaining a target road target detection model according to the RepVGG model, the YOLO target detection model and the training set; the determination module may be configured to determine whether the target road target detection model converges.

In some implementations, the video input module 330 can include an input unit.

The input unit may be configured to input the current road video to the target road target detection model when it is determined that the target road target detection model converges, and obtain corresponding road information.

In some embodiments, the obtaining module may include a fusion unit and a training unit.

The fusion unit can be used for fusing a CSP structure of the RepVGG model and the YOLO target detection model and an SPP structure of the YOLO target detection model to obtain an initial road target detection model; the training unit may be configured to input the training set to the initial road target detection model for training, and obtain a target road target detection model.

In some embodiments, the road object detection device 300 may further include a fusion module and a deletion module.

The fusion module can be used for inputting the training set into the initial road target detection model by the training unit for training, and fusing the short-term dense cascade STDC structure and the initial road target detection model before obtaining the target road target detection model to obtain the current road target detection model; the deleting module may be configured to delete the CSP structure in the current road target detection model, and obtain an updated road target detection model.

In some embodiments, a training unit may include a training subunit, an input subunit, and an obtaining subunit.

The training subunit may be configured to input the training set to the updated road target detection model for training, so as to obtain a training result; the input subunit may be configured to input the training result to the RIOU loss function, to obtain a corresponding loss value; the obtaining subunit may be configured to obtain, according to the loss value, a target road target detection model through a back propagation algorithm.

In some embodiments, updating the road target detection model may include the first CBL convolution module, the first Stage module, the second CBL convolution module, the second Stage module, the third CBL convolution module, the third Stage module, the fourth CBL convolution module, the fourth Stage module, and the Head detection Head module of the YOLO target detection model, the updated road target detection model using a shortcut structure; the first Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, the second Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, the third Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 1 SPP layer, 2 RepVGG layer and 1 CBL convolutional layer which are connected in sequence, and the fourth Stage module comprises 1 CBL convolutional layer, 1 STDC layer, 3 RepVGG layer and 1 CBL convolutional layer which are connected in sequence.

In some embodiments, the sample acquisition module may include a framing unit, a labeling unit, and a dividing unit.

The frame extracting unit can be used for extracting frames from the historical road video to obtain a frame image; the labeling unit can be used for labeling the frame image to obtain a corresponding labeled image; the dividing unit may be configured to divide the labeled image according to a preset dividing rule, so as to obtain a corresponding sample set.

In some embodiments, the road object detecting device 300 may further include an information input module and a receiving module.

The information input module can be used for inputting road information to the self-adaptive non-maximum suppression algorithm module, so that the self-adaptive non-maximum suppression algorithm module outputs corresponding target road information, and the target road information is maximum information in the road information; the receiving module may be configured to receive the target road information output by the adaptive non-maxima suppression algorithm module.

It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. For any processing manner described in the method embodiment, all the processing manners may be implemented by corresponding processing modules in the apparatus embodiment, and details in the apparatus embodiment are not described again.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 9, which shows a functional block diagram of an electronic device 400 provided by an embodiment of the present application, the electronic device 400 may include one or more of the following components: memory 410, processor 420, and one or more applications, wherein the one or more applications may be stored in memory 410 and configured to be executed by the one or more processors 420, the one or more applications configured to perform a method as described in the aforementioned method embodiments.

The Memory 410 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 410 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 410 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as obtaining current road video, inputting current road video, obtaining road information, obtaining a sample set, obtaining a target road target detection model, determining whether the target road target detection model converges, determining that the target road target detection model converges, fusing network models, obtaining an initial road target detection model, inputting a training set, training an initial road target detection model, fusing an STDC network model and the initial road target detection model, obtaining a current road target detection model, deleting a CSP network model, obtaining a target road target detection model, extracting frame historical road video, obtaining frame images, labeling frame images, obtaining labeling images, dividing labeling images, obtaining a sample set, and obtaining target road information, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may further store data (such as a current road video, a target road target detection model, road information, a road target type, a road target position, a RepVGG model, a YoLO target detection model, a historical road video, a sample set, a training set, a CSP structure, an SPP structure, an initial road target detection model, an STDC structure, a current road target detection model, an updated road target detection model, a first CBL convolution module, a Stage1 module, a second CBL convolution module, a Stage2 module, a third CBL convolution module, a Stage3 module, a fourth CBL convolution module, a Stage4 module, a Head detection Head module in the YoLO target detection model, a shortcut structure, a CBL convolution layer, a STDC layer, a VGG layer, an SPP layer, a frame image, a label image, a preset partition rule, an Adaptive NMS algorithm, target road information, and maximum value information) created by the electronic device 400 in use, and the like.

Processor 420 may include one or more processing cores. The processor 420 connects various parts throughout the electronic device 400 using various interfaces and lines, performs various functions of the electronic device 400 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 410, and calling data stored in the memory 410. Alternatively, the processor 420 may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 420 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 420, but may be implemented by a communication chip.

Referring to fig. 10, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer readable storage medium 500 has stored therein a program code 510, and the program code 510 can be called by a processor to execute the method described in the above method embodiments.

The computer-readable storage medium 500 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 500 includes a non-volatile computer-readable medium. The computer readable storage medium 500 has storage space for program code 510 for performing any of the method steps of the method described above. The program code can be read from and written to one or more computer program products. The program code 510 may be compressed, for example, in a suitable form.

Referring to fig. 11, a block diagram of a computer program product 600 according to an embodiment of the present application is shown. The computer program product 600 includes a computer program/instructions 610, the computer program/instructions 610 being stored in a computer readable storage medium of a computer device. When the computer program product 600 runs on a computer device, the processor of the computer device reads the computer program/instructions 610 from the computer-readable storage medium, and executes the computer program/instructions 610, so that the computer device performs the method described in the above method embodiments.

According to the scheme provided by the embodiment, the current road video is obtained, the current road video is input to the target road target detection model, the corresponding road information is obtained, the road information at least comprises the road target type and the road target position, the target road target detection model is obtained based on the RepVGG model and the YOLO target detection model, the target road target detection model constructed according to the RepVGG model and the YOLO target detection model is realized, the representation capability of the model is favorably improved, the target road target detection model is used for carrying out road target detection on the current road video, and the detection precision for detecting the road target is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of detecting a road target, comprising:

acquiring a current road video;

and inputting the current road video to a target road target detection model to obtain corresponding road information, wherein the road information at least comprises a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection model.

2. The method of claim 1, wherein before inputting the current road video to a target road target detection model and obtaining corresponding road information, the method further comprises:

acquiring a corresponding sample set according to the historical road video, wherein the sample set at least comprises a training set;

obtaining a target road target detection model according to the RepVGG model, the YOLO target detection model and the training set;

determining whether the target road target detection model converges;

the step of inputting the current road video to a target road target detection model to obtain corresponding road information comprises the following steps:

and when the target road target detection model is determined to be converged, inputting the current road video to the target road target detection model to obtain corresponding road information.

3. The road target detection method according to claim 2, wherein obtaining a target road target detection model according to the RepVGG model, the YOLO target detection model and the training set comprises:

fusing a RepVGG structure in the RepVGG model, a CSP structure of a YOLO target detection model and an SPP structure of the YOLO target detection model to obtain an initial road target detection model;

and inputting the training set into the initial road target detection model for training to obtain a target road target detection model.

4. The road target detection method of claim 3, wherein before inputting the training set into the initial road target detection model for training and obtaining a target road target detection model, the method further comprises:

fusing a short-term dense cascade STDC structure and the initial road target detection model to obtain a current road target detection model;

deleting the CSP structure in the current road target detection model to obtain an updated road target detection model;

inputting the training set into the initial road target detection model for training to obtain a target road target detection model, wherein the training comprises:

inputting the training set into the updated road target detection model for training to obtain a training result;

inputting the training result into the RIOU loss function to obtain a corresponding loss value;

and obtaining a target road target detection model through a back propagation algorithm according to the loss value.

5. The road target detection method according to claim 4, wherein the updated road target detection model includes a first CBL convolution module, a first Stage module, a second CBL convolution module, a second Stage module, a third CBL convolution module, a third Stage module, a fourth CBL convolution module, a fourth Stage module, and a Head detection Head module of a YOLO target detection model, and the updated road target detection model uses a shortcut structure;

first Stage module comprises 1 layer CBL convolutional layer, 1 layer STDC layer, 3 layers of RepVGG layer and 1 layer CBL convolutional layer that connect gradually, second Stage module comprises 1 layer CBL convolutional layer, 1 layer STDC layer, 3 layers of RepVGG layer and 1 layer CBL convolutional layer that connect gradually, third Stage module comprises 1 layer CBL convolutional layer, 1 layer STDC layer, 1 layer SPP layer, 2 layers of RepVGG layer and 1 layer CBL convolutional layer that connect gradually, fourth Stage module comprises 1 layer CBL convolutional layer, 1 layer STDC layer, 3 layers of RepVGG layer and 1 layer CBL convolutional layer that connect gradually.

6. The method for detecting road target according to claim 2, wherein the obtaining a corresponding sample set according to the historical road video comprises:

performing frame extraction on the historical road video to obtain a frame image;

labeling the frame image to obtain a corresponding labeled image;

and dividing the marked image according to a preset division rule to obtain a corresponding sample set.

7. The road target detection method according to any one of claims 1 to 6, characterized by further comprising:

inputting the road information to a self-adaptive non-maximum suppression algorithm module, so that the self-adaptive non-maximum suppression algorithm module outputs corresponding target road information, wherein the target road information is maximum information in the road information;

and receiving the target road information output by the self-adaptive non-maximum suppression algorithm module.

8. A road object detecting device, comprising:

the current video acquisition module is used for acquiring a current road video;

and the video input module is used for inputting the current road video to a target road target detection model to obtain corresponding road information, wherein the road information at least comprises a road target type and a road target position, and the target road target detection model is obtained based on a RepVGG model and a YOLO target detection model.

9. An electronic device, comprising:

a memory;

one or more processors coupled with the memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by one or more processors, the one or more application programs being configured to perform the road object detection method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a program code is stored therein, which program code can be called by a processor to execute a road object detection method according to any one of claims 1 to 7.