CN116958915A - Target detection method, target detection device, electronic equipment and storage medium - Google Patents

Target detection method, target detection device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116958915A
CN116958915A CN202311224701.9A CN202311224701A CN116958915A CN 116958915 A CN116958915 A CN 116958915A CN 202311224701 A CN202311224701 A CN 202311224701A CN 116958915 A CN116958915 A CN 116958915A
Authority
CN
China
Prior art keywords
image
region
target
target object
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311224701.9A
Other languages
Chinese (zh)
Inventor
燕旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311224701.9A priority Critical patent/CN116958915A/en
Publication of CN116958915A publication Critical patent/CN116958915A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The application provides a target detection method, a target detection device, electronic equipment and a storage medium, and relates to the technical fields of artificial intelligence, maps, internet of vehicles, intelligent traffic and the like. Preliminarily delineating a region to be searched of a second image based on a first object region of a first image, and obtaining a corresponding offset position of a target object from the first image to the second image by utilizing first image features of the first object region, second image features of the region to be searched and position features; the offset position and the first positioning information of the first object region can then be utilized to position the target object in the second image; thereby realizing accurate positioning of the target object in the area to be searched. The continuity between the front image and the rear image in the image sequence is utilized to realize continuous positioning detection of the target object in the multi-frame images, so that the accuracy of target detection is greatly improved. And the limitation on sample images and element object categories is removed, and the practicability and the robustness of target detection are improved.

Description

Target detection method, target detection device, electronic equipment and storage medium
Technical Field
The application relates to the technical fields of artificial intelligence, maps, internet of vehicles, intelligent traffic and the like, and relates to a target detection method, a target detection device, electronic equipment and a storage medium.
Background
With the development of computer technology and artificial intelligence, target detection is widely applied in various fields and scenes. For example, traffic elements in road images are detected in a traffic scene.
In the related art, firstly, a plurality of traffic elements are identified on a road image by using a trained classified neural network, and the identified traffic elements are detected according to gradient information of image pixels.
However, the above-mentioned method has high requirement on the quality of the image, and is easy to be mistakenly detected in the actual operation process, thereby resulting in low detection accuracy.
Disclosure of Invention
The application provides a target detection method, a target detection device, electronic equipment and a storage medium, which can solve the problem of low detection accuracy in the related technology. The technical scheme is as follows:
in one aspect, a method for detecting a target is provided, the method comprising:
determining a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence, wherein the first object region is a region where a target object is located after the first image;
respectively determining a first image characteristic of the first object area, a second image characteristic and a position characteristic of the area to be searched;
Determining an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
and obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
In one possible implementation manner, the determining the first image feature of the first object area, and the second image feature and the position feature of the area to be searched respectively includes:
respectively extracting a first image feature of the first object region and a second image feature of the region to be searched through a trained first extraction network;
and extracting the position characteristics of the region to be searched through a trained second extraction network, wherein the position characteristics represent the image positions of all pixel points in the region to be searched and the distance between the image positions and the image acquisition equipment.
In one possible implementation manner, the determining, based on the first object area of the first image in the image sequence, the area to be searched of the second image in the image sequence includes:
Determining an image mapping area corresponding to the second image of the first object area based on a first image position corresponding to the first object area;
and scaling the region range of the image mapping region corresponding to the first object region based on a preconfigured scaling coefficient to obtain the region to be searched.
In another aspect, there is provided an object detection apparatus, the apparatus comprising:
the image processing device comprises a region determining module, a search module and a search module, wherein the region determining module is used for determining a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence, and the first object region is a region where a target object is located after the second image is located in the first image;
the feature determining module is used for determining a first image feature of the first object area, a second image feature and a position feature of the area to be searched respectively;
the offset position determining module is used for determining the offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
and the positioning module is used for obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
In one possible implementation, the offset position determining module includes:
a similarity determining unit configured to determine a similarity between the first object region and at least one candidate region in the region to be searched based on the first image feature and the second image feature;
an offset position determining unit configured to determine, based on the first image feature and the position feature, a region offset position corresponding to each candidate region, the region offset position representing an offset between a position of the candidate region in an image mapping region corresponding to the first image and a position corresponding to the second image;
the offset position determining unit is further configured to obtain an area offset position corresponding to a second object area based on the similarity and the area offset position corresponding to each candidate area, and take the area offset position corresponding to the second object as the offset position corresponding to the target object; the second object region is a region including the target object in each of the candidate regions.
In one possible implementation manner, the similarity determining unit is configured to:
performing convolution operation on the second image features by taking the first image features as convolution kernels to obtain the similarity corresponding to each candidate region;
Wherein, one candidate area is an area corresponding to one sliding of the convolution kernel in the second image feature in the convolution operation process, and the size of each candidate area is the same as that of the first object area;
the offset position determining unit is used for:
and carrying out convolution operation on the position features by taking the first image features as convolution kernels to obtain the region offset positions corresponding to the candidate regions.
In one possible implementation, the offset position determining unit is further configured to:
determining a second object region meeting the similarity condition in each candidate region based on the similarity corresponding to each candidate region;
and screening the region offset positions corresponding to the second object region from the region offset positions corresponding to the candidate regions.
In one possible implementation, the positioning module is configured to:
obtaining an initial positioning result corresponding to the target object based on the offset position corresponding to the target object and first positioning information of a first object area corresponding to the first image;
and filtering the initial positioning result based on the initial positioning result and a positioning result of the target object corresponding to at least one frame of third image, so as to obtain a positioning result of the target object corresponding to the second image, wherein the at least one frame of third image is an image positioned before the second image in the image sequence.
In one possible implementation, the positioning module is configured to:
performing offset processing on first positioning information of the first object region corresponding to the first image based on the offset position corresponding to the target object to obtain second positioning information of the second object region corresponding to the second image;
and obtaining the initial positioning result based on the second positioning information.
In one possible implementation, the positioning module is configured to:
taking the second positioning information as the initial positioning result;
and adjusting the second object area into a third object area based on the target adjustment parameter and the second positioning information, and taking third positioning information of the third object area corresponding to the second image as the initial positioning result.
In one possible implementation manner, the offset positions corresponding to the target objects include a first offset position and a second offset position;
the first offset position represents the offset of the target object at the image positions corresponding to the first image and the second image respectively, and the second offset position represents the offset of the distance between the target object and the image acquisition equipment;
The positioning module is used for:
performing offset processing on a first image position in the first positioning information to obtain a second image position based on the first offset position, and performing offset processing on a first distance in the first positioning information to obtain a second positioning information comprising the second image position and the second distance based on the second offset position;
the positioning module is further used for:
and based on the target adjustment parameters, respectively adjusting the first positioning point position and the first area size in the second image position into a second positioning point position and a second area size, and obtaining the third positioning information, wherein the second positioning point position and the second area size are the positioning point position and the positioning point size corresponding to the third object area.
In one possible implementation, the positioning module is further configured to:
determining weights corresponding to the third images of the frames and the second images respectively based on the image intervals between the third images of the frames and the second images;
and weighting the initial positioning result and the positioning result corresponding to the third image of each frame based on the weights corresponding to the third image of each frame and the second image respectively to obtain the positioning result of the target object corresponding to the second image.
In one possible implementation manner, the feature determining module is configured to:
respectively extracting a first image feature of the first object region and a second image feature of the region to be searched through a trained first extraction network;
and extracting the position characteristics of the region to be searched through a trained second extraction network, wherein the position characteristics represent the image positions of all pixel points in the region to be searched and the distance between the image positions and the image acquisition equipment.
In one possible implementation manner, the first object area where the target object in the first image is located is obtained by any one of the following:
performing object detection on the first image to obtain a first object region where a target object in the first image is located;
and obtaining a first object region in which the target object is located in the first image based on a region in which the target object is located in a fourth image and an offset position of the target object corresponding to the fourth image to the first image, wherein the fourth image is an image before the first image in the image sequence.
In one possible implementation manner, the apparatus further includes a first acquisition module, where the first acquisition module is specifically configured to, before determining, based on a first object region of a first image in an image sequence, a region to be searched for in a second image in the image sequence:
Responding to a starting instruction of a vehicle, and periodically acquiring images of a front road in the running process of the vehicle through image acquisition equipment of the vehicle to obtain an image sequence;
detecting traffic elements of each frame of image in the image sequence to obtain a first object area in the first image, wherein the target object is any one of at least one traffic element, and the first image is a first frame of image containing the any one traffic element;
the device further comprises a first display module, wherein the first display module is used for obtaining a positioning result of the target object corresponding to the second image after being based on the offset position corresponding to the target object and first positioning information of a first object area corresponding to the first image, and is specifically used for at least one of the following:
displaying, by a display screen of the vehicle, a second image including an image position marker of the any one of the traffic elements, the positioning result including an image position of the any one of the traffic elements;
and broadcasting the distance between any traffic element and the vehicle through the sound equipment of the vehicle, wherein the positioning result comprises the distance between any traffic element and the vehicle.
In one possible implementation manner, the apparatus further includes a second acquisition module, where the second acquisition module is specifically configured to, before the determining, based on the first object region of the first image in the image sequence, a region to be searched for in the second image in the image sequence:
responding to a detection instruction of a target road section, and periodically carrying out image acquisition on the target road section through image acquisition equipment associated with the target road section to obtain an image sequence;
detecting the driving behavior of a vehicle on each frame of image in the image sequence to obtain a first object area in the first image, wherein the object is a target vehicle with a preset driving behavior, and the first image is a first frame image containing the target vehicle;
the device further comprises a second display module, wherein the second display module is specifically used for at least one of the following after obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image:
displaying a second image comprising a location marker of the target vehicle, the location marker comprising at least one of an image location of the target vehicle in the second image or a distance between the target vehicle and an image acquisition device; the positioning result comprises a position mark of the target vehicle;
And displaying target prompt information, wherein the target prompt information comprises vehicle identification information of the target vehicle and preconfigured driving behaviors of the target vehicle in the target road section.
In one possible implementation manner, the area determining module is configured to:
determining an image mapping area corresponding to the second image of the first object area based on a first image position corresponding to the first object area;
and scaling the region range of the image mapping region corresponding to the first object region based on a preconfigured scaling coefficient to obtain the region to be searched.
In another aspect, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory, the processor executing the computer program to implement the above-described target detection method.
In another aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the above-described object detection method.
In another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described object detection method.
The technical scheme provided by the embodiment of the application has the beneficial effects that:
according to the target detection method provided by the embodiment of the application, the to-be-searched area of the second image is initially defined based on the first object area of the first image, and the offset position of the target object corresponding to the first image to the second image is obtained by utilizing the first image characteristic of the first object area, the second image characteristic of the to-be-searched area and the position characteristic; the offset position and the first positioning information of the first object region can then be utilized to position the target object in the second image; thereby realizing accurate positioning of the target object in the area to be searched. The application realizes the continuous positioning detection of the target object in the multi-frame images by utilizing the continuity between the front image and the rear image in the image sequence without depending on the image quality, thereby greatly improving the accuracy of target detection. In addition, the limitation on the conditions such as the balance of the sample image and the category of the element object is eliminated, and the practicability and the robustness of target detection are further improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram of an implementation environment of a target detection method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a target detection method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a front road image according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a candidate frame according to an embodiment of the present application;
fig. 5 is a schematic diagram of a detection network structure according to an embodiment of the present application;
fig. 6 is a schematic view of a scene of detecting traffic elements on a road image according to an embodiment of the present application;
FIG. 7 is a schematic diagram of road images in front of a first frame and a second frame according to an embodiment of the present application;
FIG. 8 is a schematic diagram of road images in front of a second frame and a subsequent frame according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a target detection flow provided in an embodiment of the present application;
FIG. 10 is a schematic diagram of a flow characteristic prediction step in a target detection flow according to an embodiment of the present application;
FIG. 11 is a schematic view of a scene flow of a target detection method according to an embodiment of the present application;
fig. 12 is a schematic view of a scene flow of a target detection method according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an object detection device according to an embodiment of the present application;
Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. The terms "comprises" and "comprising" as used in embodiments of the present application mean that the corresponding features may be implemented as presented features, information, data, steps, operations, but do not exclude the implementation as other features, information, data, steps, operations, etc. supported by the state of the art.
In the specific embodiment of the application, any data related to the object, such as a front road image, an image of a target road section, data related to a detection process of the target object in an image sequence, and the like, during the running process of a vehicle where the object is located, when the embodiment of the application is applied to a specific product or technology, the permission or consent of the object needs to be obtained, and the collection, the use and the processing of the related data need to comply with related laws and regulations and standards of related countries and regions. That is, in the embodiment of the present application, if any of the above-mentioned data related to the subject is referred to, the data needs to be obtained through the subject authorization agreement and the compliance with the relevant laws and regulations and standards of the country and region.
Fig. 1 is a schematic diagram of an implementation environment of a target detection method according to the present application. As shown in fig. 1, the implementation environment includes: a server 101 and a terminal 102.
In one possible scenario, the terminal 102 is installed with an application, and the server 101 may be a background server of the application. The terminal 102 and the server 101 may interact with data based on the application. For example, the terminal 102 may send a detection request to the server 101 and transmit in real time a real-time acquired image sequence, such as an image sequence of a road ahead acquired by the vehicle-mounted terminal through a camera; the server 101 determines the positioning result of the target object in each frame image of the image sequence based on the detection request by adopting the target detection method of the present application, and feeds back the positioning result of the target object corresponding to each frame image to the terminal 102.
In another possible scenario, the target detection method of the present application may also be executed by the terminal 102, that is, the terminal 102 performs the target detection method of the present application on its own based on the acquired image sequence, so as to obtain a positioning result of the target object corresponding to each frame of image.
It should be noted that, the server 101 may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, and basic cloud computing services such as big data and an artificial intelligence platform.
The terminal 102 may be a vehicle-mounted terminal (e.g., a vehicle navigation terminal, a vehicle-mounted computer, etc.), a traffic monitoring device, an intelligent transportation system, a smart phone, a tablet computer, a notebook computer, a digital broadcast receiver, a desktop computer, a smart speaker, a smart watch, etc.
The terminal 102 and the server 101 may be directly or indirectly connected through wired or wireless communication, or may be determined based on actual application requirements, which is not limited herein.
Fig. 2 is a schematic flow chart of a target detection method according to an embodiment of the present application. The method may be performed by an electronic device, which may be the server 101 or the terminal 102 in the implementation environment shown in fig. 1. As shown in fig. 2, the method includes the following steps.
Step 201, the electronic device determines a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence.
The second image follows the first image, and the first object region is a region where the target object is located. The first object area and the area to be searched are corresponding partial image areas in the first image and the second image respectively. The image sequence comprises at least two frames of images acquired periodically by an image acquisition device for the target scene area. The second image may be a next frame image of the first image, or the second image may be a 2 nd frame, a 3 rd frame image, or the like after the first image. For example, the image capturing apparatus captures 10 frames, 100 frames, etc. every second, that is, one frame every 0.1 seconds or 0.01 seconds. The target object may be an element object to be identified in the target scene area, for example.
In a scene example, the image sequence may be a road image of a road ahead during driving of the vehicle, and the target object may be a traffic element in the road image; for example, the traffic element includes, but is not limited to: traffic signs, pavement marking lines, vehicles or obstacles around the vehicle, pedestrians, buildings, etc.
In yet another example scenario, the image sequence may be a road image that monitors a target road segment, and the target object may be a vehicle in the road image; for example, the target object may be a vehicle with a driving violation, a malfunctioning vehicle, etc.
In the application, the images in the image sequence are acquired periodically, so that the acquisition time of each frame of image is continuous, and the image picture content of each frame of image is continuous. The electronic device may initially delineate the region to be searched for a mapped location of the first object region in the second image.
In one possible implementation, the implementation of step 201 may include: the electronic equipment determines an image mapping area corresponding to the first object area in the second image based on a first image position corresponding to the first object area; and the electronic equipment performs scaling processing on the area range of the image mapping area corresponding to the first object area based on a preset scaling coefficient to obtain the area to be searched. The image position of the image mapping region in the second image is the same as the image position of the first object region in the first image. For example, an area of the same position coordinates in the second image may be taken as an image mapping area corresponding to the first object area based on the image position coordinates of the first object area in the first image.
In a possible example, as shown in fig. 3, the first image and the second image are front road images during driving of the vehicle, and the target object may be a traffic speed limit board in the road, such as speed limit 40km/h. The electronic equipment can expand an image mapping area corresponding to the position in the second frame image according to the position of the area occupied by the traffic speed limit plate in the first frame image, and the expanded area is used as an area to be searched. For example, if the speed limit plate size in the first frame image is w×h and the scaling factor is 3, the area with the size of 3w×3h corresponding to the position in the second frame image is taken as the area of the speed limit plate to be searched.
In yet another possible example, if the first image and the second image are images monitored for the target road segment. The target object may be a target vehicle with a driving violation in the road; if the target vehicle is driving away from the image acquisition device, the electronic device can reduce the image mapping area corresponding to the position in the second frame image according to the position of the area occupied by the target vehicle in the first frame image, and the reduced area is used as the area to be searched.
In the present application, the first object area where the target object in the first image is located may be obtained by any one of the following modes:
The first mode is that the electronic equipment performs object detection on the first image to obtain a first object area where a target object in the first image is located;
the first image may be a first frame image of the image sequence in which the target object is detected. In the application, the first object region of the first frame image of the target object can be utilized to detect, and the positioning result of the target object in each frame image after the first frame image is obtained through the target detection method of the application, so as to realize continuous positioning detection of the target object in the image sequence.
The electronic device can perform object detection on the first image through a detection network to obtain the first object region.
In one possible example, the process of performing object detection on the first image through the detection network to obtain the first object region may include the following steps A1-A3:
a1, the electronic equipment performs feature extraction on a first image through a detection network to obtain image features corresponding to the first image;
by way of example, a feature extraction network may be included in the detection network, which may include a Convolution layer (Convolition), a normalization layer (Batch Normalization, BN) and an activation layer (Rectified Linear Unit, relu). The convolution layer is used for extracting basic features such as edge textures. The normalization layer is used for carrying out normalization processing on the features extracted by the convolution layer according to normal distribution, filtering noise features in the features, and enabling training convergence of the model to be faster. The activation layer is used for carrying out nonlinear mapping on the features extracted by the convolution layer, so that the generalization capability of the network is enhanced.
The electronic equipment can carry out convolution processing on the first image through the convolution layer to obtain convolution characteristics corresponding to the first image; and carrying out normalization processing on the convolution characteristics through a normalization layer, and carrying out nonlinear mapping processing on the characteristics obtained through the normalization processing through an activation layer to obtain the image characteristics corresponding to the first image.
A2, aiming at each feature point of the image feature, the electronic equipment generates a plurality of candidate frames corresponding to each feature point;
the image features may be in the form of feature maps. The feature map includes a plurality of feature points, each feature point characterizing a feature of a corresponding pixel region in the first image. In the step, for each feature point in the feature map, the electronic device constructs at least one candidate frame corresponding to the feature point; each candidate frame contains the corresponding feature points.
In an example, the electronic device may randomly generate a plurality of candidate boxes including each feature point with each feature point as a center in a manner of random size and random aspect ratio. In yet another example, the electronic device generates a plurality of candidate boxes corresponding to each feature point based on a pre-configured generation rule. For example, the preconfigured generation rule may include: and generating a certain number of candidate frames according to the pre-configured size or aspect ratio by taking the feature points as centers.
FIG. 4 is a schematic diagram of a plurality of candidate boxes generated according to a pre-configured generation rule provided by the present application. As shown in fig. 4, the preconfigured generation rule may be: aspect ratios are 1: 1. 2: 1. 1:2, the scales are respectively 1, 2 and 3; the electronic device takes each characteristic point as a center point, and generates 1 for each scale respectively: 1. 2: 1. 1:2, and obtaining 9 candidate frames corresponding to the feature points based on the 3 candidate frames with three proportions.
A3, the electronic equipment performs object detection on the image features of the image areas corresponding to the candidate frames to obtain the probability that the image areas corresponding to the candidate frames contain the objects to be identified; and obtaining a first object region where the target object is based on the probability that the image region corresponding to each candidate frame contains the object to be identified.
In this step, the electronic device may determine, based on image features of image areas corresponding to each candidate frame, a probability that the image area corresponding to each candidate frame includes an object to be identified; and obtaining a first object area where the target object is based on the probability corresponding to each candidate frame. For example, the maximum probability of the probabilities corresponding to the candidate frames may be determined, and the image region corresponding to the candidate frame corresponding to the maximum probability may be used as the region containing the object to be identified, based on which the first object region where the target object in the first image is located may be obtained.
Fig. 5 is a schematic diagram of a detection network structure according to an embodiment of the present application. As shown in fig. 5, the detection network mainly includes a feature extraction network (Conv filters) and a candidate block selection network (Region Proposal Network, RPN). The electronic device may input a first image into the detection network, such as a road image; firstly, extracting image features, such as feature graphs, through a feature extraction network; obtaining a plurality of candidate frames corresponding to each feature point in the feature map by using RPN; then, the feature map of the first image and each candidate box are input into a region of interest pooling layer for pooling. The object to be identified may include a plurality of objects of object types, and the classification network may be used to perform classification detection, that is, the pooled features obtained by the pooling process are input into the classification network to perform classification, and the detected area occupied by the target object in the first image is obtained based on the classification result.
Fig. 6 is a schematic view of a scene of traffic element detection for a road image according to an embodiment of the present application. As shown in fig. 6, the left side is a road image of the front road acquired during the running of the vehicle. The road image can be subjected to traffic element detection through the detection network, the right road image marked with the detected traffic sign is output, and the rectangular frame corresponding to the traffic sign can be calibrated in the road image as shown in the right road image.
In a second mode, the electronic device obtains a first object area where the target object in the first image is located based on the area where the target object in the fourth image is located and an offset position corresponding to the target object from the fourth image to the first image.
The fourth image is the image of the sequence of images preceding the first image.
The first image may also be an image of the image sequence after the first frame image of the target object is detected, for example. The region of the target object in the first frame image can be obtained by detection through the detection network; the first image after the first frame, such as the first image, can be based on the area where the target object is located in the previous image, and the first object area in the first image can be obtained in the same manner as the determination of the positioning result of the target object in the second image based on the first object area of the first image in the application. Wherein the process of determining the positioning result of the target object in the second image on the basis of the first object region of the first image is specifically referred to as steps 201-204.
Step 202, the electronic device determines a first image feature of the first object area, and a second image feature and a position feature of the area to be searched, respectively.
The electronic equipment can intercept a first object area and an area to be searched from a first image and a second image respectively, input the intercepted first object area and the intercepted area to be searched into a trained target extraction network, extract the first image characteristic of the first object area, and extract the second image characteristic and the position characteristic of the area to be searched. The first image feature and the second image feature represent features of multiple dimensions such as color, brightness, optical flow, semantics and the like of the corresponding image region.
In one possible example, the location feature may characterize an image location of the region to be searched in the second image; for example, the position coordinates of the region to be searched in the image coordinate system of the second image. In yet another possible example, the location feature may characterize an image location of the region to be searched at the second image, and a distance between the region to be searched and the image acquisition device.
In one possible implementation, the same convolutional neural network may be used to extract image features of the first object region and the region to be searched, and an additional network layer may be used to extract location features of the region to be searched. The target extraction network may include a first extraction network for performing image feature extraction, and a second extraction network for performing location feature extraction. Accordingly, this step 202 may include: the electronic equipment respectively extracts a first image feature of the first object region and a second image feature of the region to be searched through a trained first extraction network; the electronic equipment extracts the position characteristics of the region to be searched through a trained second extraction network, wherein the position characteristics represent the image positions of all pixel points in the region to be searched and the distance between the image positions and the image acquisition equipment.
For example, the size of the first object region may be expressed as w×h, and the extracted first image feature may be expressed as w×h×d; d represents feature depth, w and h represent feature map dimensions.
For example, the size of the region to be searched may be expressed as 3w×3h, and the extracted first image feature and position feature may be expressed as 3w×3h×3d;3d represents feature depth; wherein the 1 st d, i.e. 3w×3h×d, is the second image feature. The 2 nd d and the 3 rd d, i.e. 3w×3h×2d, are the location features. For example, the features corresponding to the 2 nd and 3 rd of the 3w×3h×2d may represent the position features of the x-axis and y-axis in the image coordinate system of the second image.
Step 203, the electronic device determines an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature.
The electronic device may define a second object region corresponding to the first object region from the region to be searched based on the first image feature and the second image feature, determine offset positions between image positions corresponding to the second object region and the first object region, respectively, based on the first image feature and the position feature, and use the determined offset positions as offset positions corresponding to the target object from the first image to the second image.
The offset position of the second object region corresponding to the first object region refers to an offset of the image position of the second object region in the second image relative to the image position of the first object region in the first image.
In one possible implementation, step 203 is implemented by the following steps 2031-2033:
step 2031, the electronic device determines a similarity between the first object area and at least one candidate area in the area to be searched based on the first image feature and the second image feature.
Step 2032, the electronic device determines, based on the first image feature and the position feature, a region offset position corresponding to each candidate region, where the region offset position characterizes an offset between a position of the candidate region in the image mapping region corresponding to the first image and a position corresponding to the second image.
Step 2033, the electronic device obtains a region offset position corresponding to the second object region based on the similarity and the region offset position corresponding to each candidate region, and uses the region offset position corresponding to the second object as the offset position corresponding to the target object; the second object region is a region including the target object in each of the candidate regions.
In step 2031, the electronic device performs a convolution operation on the second image feature with the first image feature as a convolution kernel, to obtain a similarity corresponding to each candidate region. Wherein, one candidate region is a region corresponding to one sliding of the convolution kernel on the second image feature in the convolution operation process, and the size of each candidate region is the same as that of the first object region.
Illustratively, in step 2032, the electronic device uses the first image feature as a convolution kernel to perform a convolution operation on the position feature to obtain a region offset position corresponding to each candidate region.
The offset position of the region corresponding to one candidate region can represent the offset of the image position corresponding to the candidate region; alternatively, the offset of the image position corresponding to the candidate region, and the offset of the distance of the candidate region from the image acquisition device are characterized.
In one possible manner, the offset of the image position corresponding to the candidate region is exemplified by the region offset position.
For example, after determining the feature matrix θ corresponding to the second image feature of the region to be searched and the feature matrix Φ corresponding to the first image feature of the first object region, the convolution operation may be performed on θ with Φ as a convolution kernel. That is, the region to be searched is convolved with the first image feature of the first object region as a convolution kernel.
For example, the feature matrix Φ may be expressed as w×h×d, the feature matrix θ may be expressed as 3w×3h×3d, and the feature matrix θ of the size of w×h×d is convolved with the feature matrix Φ of the size of w×h×d as a convolution kernel. In the convolution calculation process, the 3w×3h×d corresponding to the 1 st d in the 3w×3h×3d feature matrix θ and the 3w×3h×d corresponding to the 2 nd d and the 3 rd d may be convolved by using the w×h×d feature matrix Φ. Assuming that the step size stride of the convolution calculation is w, the convolution calculation result is feature map of 3 x 3 dimensions.
In a feature map of 3 x 3 dimensions, the feature depth is 3, i.e. comprising 3 x 3 feature matrices.
Wherein, the 1 st 3×3, that is, 3×3×1, is a convolution operation result of w×h×d to 3w×3h×3d corresponding to the 1 st d in the 3w×3h×3d, and the meaning of the 3×3×1 representation is the similarity of the area to be searched and the first object area, that is, 9 similarities of the first object area and 9 candidate areas in the area to be searched. Wherein, the larger the value of the similarity between the first object region and the candidate region, the more similar the corresponding image features.
Wherein, the 2 nd 3×3 and the 3 rd 3×3, that is, 3×3×2, are convolution operation results of w×h×d on the 2 nd d and the 3w×3h×2d corresponding to the 3 rd d in the 3w×3h×3d, and the 3×3×2 represents information of two dimensions, and represents offset positions of the area to be searched relative to the first object area in the x axis and the y axis directions, that is, Δx and Δy, respectively. Since the image contents of the first image and the second image have continuity, Δx and Δy can represent stream offset information of the corresponding image contents in the area to be searched. The 3×3×2 may include 9 pairs of offset values corresponding to the 9 candidate regions, respectively, where each pair of offset values includes 1 Δx and 1 Δy.
It should be noted that, in the target extraction network, the first extraction network is used to extract image features of an image, including but not limited to features of color, brightness, optical flow, and the like. The second extraction network is used for extracting the position features of the image, including the features of the image position, the distance from the image acquisition equipment and the like. In the application, the first extraction network and the second extraction network can be trained in advance through a large number of samples.
By way of example, the training process may include the following steps C1-C7:
step C1, acquiring a sample set;
wherein the sample set comprises a plurality of sample image pairs, each sample image pair comprising a first sample image and a second sample image. For example, each sample image in each sample image pair may be an image containing a sample object, for example, a sample road image containing a traffic sign, building, pedestrian, or the like to be detected. The first sample image and the second sample image of each sample image pair may be images comprising the same scene area; for example, including speed limit signs in the same road segment. The sample label of each sample image comprises a truth value area where the sample object in the sample image is located and the position marking information of the truth value area in the sample image. The truth value area is an image area where a sample object in the sample image is actually located; the position marking information of the truth area is used for marking the position of the truth area corresponding to the sample image.
Step C2, determining a first sample area where a sample object is located in the first sample image for each sample image pair, and determining a sample area to be detected of the second sample image based on the first sample area;
step C3, extracting first sample image features of a first sample area and second sample image features of a sample area to be detected from each sample image pair through a first initial network; extracting sample position characteristics of a sample area to be detected through a second initial network;
step C4, for each sample image pair, taking the first sample image feature as a convolution kernel, and respectively carrying out convolution operation on the second sample image feature and the sample position feature to respectively obtain the similarity of the first sample region and each candidate sample region in the sample region to be detected and obtain the offset position between the first sample region and each candidate sample region;
step C5, predicting and obtaining a predicted sample area in each candidate sample area based on the similarity between the first sample area and each candidate sample area for each sample image pair, and obtaining an offset position between the first sample area and the predicted sample area;
Step C6, predicting the predicted position of the area where the sample object is located in the second sample image based on the offset position between the first sample area and the predicted sample area and the position marking information of the first sample area in the first sample image for each sample image pair;
step C7, for each sample image pair, performing iterative training on the first initial network and the second initial network based on the difference between the true value region in the sample label of the second sample image and the predicted sample region and the difference between the position labeling information of the true value region in the sample label of the second sample image and the predicted position of the predicted sample region; and obtaining the first extraction network and the second extraction network until the training stopping condition is met.
For example, the training stop conditions may include, but are not limited to: the iteration number reaches a specified number threshold, the training time reaches a specified time threshold, etc.
In the present application, the training methods of the first extraction network and the second extraction network are illustrated by taking the steps C1 to C7 as an example, but the training methods of the first extraction network and the second extraction network are not particularly limited.
Illustratively, in step 2033, the electronic device determines, based on the similarity corresponding to each candidate region, a second object region in each candidate region that meets a similarity condition; and screening the region offset positions corresponding to the second object region from the region offset positions corresponding to the candidate regions. Wherein the similarity condition may be that the similarity is the largest; that is, the region offset position of the candidate region of the maximum similarity is set as the offset position corresponding to the target object from the first image to the second image.
In one possible manner, the offset position corresponding to the target object may include a first offset position. The first offset position may represent an offset of the target object in the image position corresponding to the first image and the second image, respectively, i.e. an offset between the image position of the target object in the region of the first image and the image position of the target object in the region of the second image.
For example, after the electronic device obtains the convolution calculation result, a candidate region corresponding to the maximum similarity may be determined from the regions to be searched based on 9 similarities in the 1 st 3×3×1, as a region most similar to the first object region, that is, a second object region including the target object in the regions to be searched. And based on the 9 pairs of deltax and deltay corresponding to the 9 candidate regions in the 2 nd and 3 rd 3×3 in the convolution calculation result, that is, 3×3×2, respectively, deltax and deltay corresponding to the second object region are obtained, and deltax and deltay corresponding to the second object region can be used as offset positions of the target object from the first image to the second image.
The above description is exemplified by only representing the image position by the position feature. However, if the location feature also characterizes the distance between the area to be searched and the image acquisition device, for example the distance between the traffic speed limit sign and the vehicle; the offset corresponding to the distance may be further obtained.
In yet another possible manner, the region offset position corresponding to the candidate region may characterize an offset of the image position corresponding to the candidate region, as well as an offset of the distance from the image acquisition device. Accordingly, the offset position corresponding to the target object may include a first offset position and a second offset position, the first offset position may represent an offset of the target object at the first image, at the image position corresponding to the second image, respectively, and the second offset position may represent an offset of a distance between the target object and the image acquisition device.
Illustratively, the feature matrix Φ corresponding to the first object region is denoted as w×h×d; the feature matrix θ corresponding to the region to be searched may be expressed as 3w×3h×4d; wherein, the 4 th d characterizes the depth direction of the area to be searched, namely the distance between the area to be searched and the image acquisition equipment. For example, the depth direction may be expressed as a z-axis direction.
The feature matrix phi with the size of w multiplied by h multiplied by d is taken as a convolution kernel, the feature matrix theta with the size of 3w multiplied by 3h multiplied by 4d is subjected to convolution calculation, and the obtained convolution calculation result is a feature map with the dimension of 3 multiplied by 4.
Wherein the 1 st 3×3, that is, 3×3×1, still represents 9 similarities of the first object region corresponding to 9 candidate regions in the region to be searched.
Wherein the 2 nd 3X 3, the 3 rd 3X 3 and the 4 th 3X 3, i.e. 3X 3, the offset positions of the region to be searched with respect to the first object region in the x-axis, y-axis, and z-axis directions respectively, that is, 9 pairs Δx, Δy, and Δz corresponding to the 9 candidate regions respectively, are indicated. Δz represents an offset value of the distance between the candidate region and the image capturing apparatus. Based on this, offset positions Δx, Δy, and Δz corresponding to the second target region can be obtained.
Step 204, the electronic device obtains a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
The positioning result of the target object corresponding to the second image may include: the image position of the region where the target object is located in the second image; alternatively, the distance between the region of the target object in the second image and the image acquisition device may be further included. In this step, the electronic device may obtain an initial positioning result of the target object corresponding to the second image based on the offset position and the first positioning information corresponding to the target object, and combine the positioning result corresponding to the multi-frame third image before the second image to filter the initial positioning result, so as to obtain a more accurate positioning result. By way of example, an implementation of this step 204 may include the following steps 2041-2042:
Step 2041, the electronic device obtains an initial positioning result corresponding to the target object based on the offset position corresponding to the target object and the first positioning information corresponding to the first object region in the first image;
step 2042, the electronic device filters the initial positioning result based on the initial positioning result and a positioning result of the target object corresponding to at least one frame of third image, where the at least one frame of third image is an image located before the second image in the image sequence, to obtain a positioning result of the target object corresponding to the second image.
In one possible implementation, the electronic device may perform an offset process on the positioning information of the first object area based on the offset position, and obtain the initial positioning result based on the offset processing result. For example, this step 2041 may include the following steps 2041a and 2041b:
step 2041a, the electronic device may perform offset processing on the first positioning information of the first object region corresponding to the first image based on the offset position corresponding to the target object, to obtain second positioning information of the second object region corresponding to the second image;
the first positioning information may include a first image position of the first object region, and the offset position of the target object may include a first offset position. Alternatively, the first positioning information includes a first image position, and a first distance between the first object region and the image acquisition device; the offset positions of the target object may include a first offset position and a second offset position. Accordingly, the implementation of step 2041a may include the following two approaches.
In one form, the offset position includes a first offset position, and the first positioning information includes a first image position; accordingly, step 2041a includes: the electronic device may perform offset processing on the first image position in the first positioning information to obtain second positioning information including the second image position based on the first offset position.
For example, the first image location may include a setpoint location of the first object region and a size of the first object region. In this step, the electronic device may perform offset processing on the anchor point position in the first image position.
The first positioning information may be expressed as (x, y, w, h); wherein, (x, y) represents the location of the anchor point corresponding to the first object region; (w, h) represents the region size corresponding to the first object region. For example, x and y are x-axis coordinates and y-axis coordinates, respectively, of the vertex of the first object region at the upper left corner in the image coordinate system of the first image; w and h represent the width and height of the first object region. The x and y may be offset by Δx and Δy corresponding to the first offset position, and the obtained second positioning information may be expressed as (x+Δx, y+Δy, w, h).
In yet another aspect, the offset positions include a first offset position and a second offset position, and the first positioning information includes a first image position and a first distance; accordingly, step 2041a includes: the electronic device may perform an offset process on a first image position in the first positioning information to a second image position based on the first offset position, and perform an offset process on a first distance in the first positioning information to a second distance based on the second offset position, to obtain second positioning information including the second image position and the second distance.
For example, the first location information may be expressed as (x, y, z, w, h), (x, y, z) representing the location of the location point corresponding to the first object region. For example, x, y, z are the x-axis coordinate, the y-axis coordinate, and the distance of the first object region from the image capturing device in the image coordinate system of the first image, respectively, of the vertex of the first object region at the upper left corner; w and h represent the width and height of the first object region. The x, y, and z may be offset by Δx and Δy corresponding to the first offset position and Δz corresponding to the second offset position, and the obtained second positioning information may be expressed as (x+Δx, y+Δy, z+Δz, w, h).
Step 2041b, the electronic device obtains the initial positioning result based on the second positioning information.
In one possible manner, step 2041b includes: the electronic device can use the second positioning information as the initial positioning result. For example, the second object region corresponding to (x+Δx, y+Δy, w, h) or (x+Δx, y+Δy, z+Δz, w, h) may be used as the region where the target object is located in the second image.
In another possible manner, the electronic device may further adjust the second object area, and step 2041b includes: the electronic equipment adjusts the second object area into a third object area based on the target adjustment parameter and the second positioning information, and takes third positioning information corresponding to the third object area in the second image as the initial positioning result.
Illustratively, the electronic device adjusts the image position corresponding to the second object region based on the target adjustment parameter; the process may include: and the electronic equipment adjusts the first positioning point position and the first area size in the second image position into a second positioning point position and a second area size respectively based on the target adjustment parameter to obtain the third positioning information, wherein the second positioning point position and the second area size are the positioning point position and the positioning point size corresponding to the third object area.
For example, if the second positioning information is represented as (x 2, y2, w2, h 2), the target adjustment parameter may be a 1×4 parameter matrix; after the 1×4 parameter matrix adjustment, the third positioning information may be expressed as (x 3, y3, w3, h 3).
For another example, if the second positioning information is represented as (x 2, y2, z2, w2, h 2), the target adjustment parameter may be a 1×5 parameter matrix; after the 1×5 parameter matrix adjustment, the third positioning information may be expressed as (x 3, y3, z3, w3, h 3).
For example, the target adjustment parameter may be obtained by performing synchronous training with the first extraction network and the second extraction network. For example, for the process of training the first extraction network and the second extraction network through the steps C1-C7, in step C6, the predicted position obtained in step C6 may be adjusted by using the initial adjustment parameter, so as to obtain the adjusted predicted position. Accordingly, step C7 may be replaced with: for each sample image pair, performing iterative training on the first initial network, the second initial network and initial adjustment parameters based on the difference between the truth value region in the sample label of the second sample image and the prediction sample region and the difference between the position labeling information of the truth value region in the sample label of the second sample image and the adjusted prediction position; and obtaining the first extraction network, the second extraction network and the target adjustment parameters until the training stopping condition is met. The initial adjustment parameter may be a 1×4 initial parameter matrix or a 1×5 initial parameter matrix.
The size of the target object in the region occupied by the first image and the size of the target object in the region occupied by the second image also change as the distance between the target object and the image capturing apparatus changes from the first image to the second image. For example, the area of traffic signs in the road ahead in the road image sequence may become progressively larger during the travel of the vehicle. After the target adjustment parameters are adjusted, the final third object area can be more in line with the actual image position of the target object in the second image, and the positioning accuracy is further improved.
In one possible implementation, the electronic device may filter the initial positioning result in combination with the separation distance between the third image and the second image of each frame. Illustratively, this step 2042 may include the following steps B1-B2:
step B1, the electronic equipment determines weights respectively corresponding to the third image of each frame and the second image based on the image interval between the third image of each frame and the second image;
and B2, the electronic equipment performs weighting processing on the initial positioning result and the positioning result corresponding to the third image of each frame based on the weights corresponding to the third image of each frame and the second image to obtain the positioning result of the target object corresponding to the second image.
Illustratively, the closer the interval between the third image and the second image, the greater the weight corresponding to the third image, that is, the higher the confidence of the positioning result corresponding to the third image. The second image itself has the greatest weight.
In one possible example, the electronic device may calculate, based on the third image of each frame, the frame number of the second image in the image sequence, and the positioning results corresponding to the various third images, and the initial positioning result corresponding to the second image, the final positioning information corresponding to the target object in the second image by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing final positioning information corresponding to the target object in the second image; />Representing the total frame number of the third image and the second image of each frame; />A frame number of the third image and the second image of each frame in the image sequence; />A frame number of the third image and the second image of each frame in the image sequence; />Representing the third image and the second image of each frame +.>And the image position coordinates of the region where the target object is located in the frame image.
Wherein, the liquid crystal display device comprises a liquid crystal display device,representing weights, denominator ∈ ->A sum of the number of frames representing each frame image; />Which indicates the frame number, e.g.,=1, 2, 3 … …, representing frame 1, frame 2, frame 3 … … in the image sequence; the larger the frame number, the later. Thus, the closer to the second image, the larger the molecules, and the greater the weight.
For example, if the second image is the 5 th frame in the image sequence and n is set to 5, weighting processing is performed on the positioning information corresponding to each frame of image by using weights corresponding to the 1 st, 2 nd, 3 rd, 4 th and 5 th frames of images, so as to obtain final positioning information of the target object in the fifth frame of image.
In the application, after the area of the target object in the second image is detected based on the first object area, the positioning detection of the target object is further carried out on each frame of image after the second image based on the positioning result of the target object in the second image. To obtain the positioning result of the target object in the image sequence for a plurality of frames, namely a positioning information sequence, wherein the positioning information sequence can be expressed as: [ (x) n ,y n ),(x n-1 ,y n-1 ),(x n-2 ,y n-2 )……]。
For example, as shown in fig. 7, after the traffic sign in the road image in front of the first frame is detected by the detection network, the area w×h may be enlarged in the second frame image based on the area w×h where the traffic sign is located in the first frame image, so as to initially define the area to be searched for in the second frame. For example, the area to be searched may be 3w×3h; of course, it may be 4w×4h. And searching in the area to be searched by using the method of the application to further outline the area where the traffic sign is located in the area to be searched, and combining the offset positions corresponding to the traffic sign to obtain the final positioning result of the traffic sign in the second frame image.
Of course, as shown in fig. 8, the positioning result in the second frame image may be further utilized, and the target detection method of the present application may be executed iteratively frame by frame, so as to obtain the positioning results of the traffic sign in the third frame image, the fourth frame image and the fifth frame image.
In the application, the positioning information of the target object in the continuous multi-frame images is continuously positioned and detected through iteration frame by frame, and multi-frame fusion can be carried out by using the positioning results of the multi-frame images, so that the target object is efficiently and accurately identified, the real-time positioning and detection are realized, and the accuracy and the real-time performance of the target detection are improved.
The following describes the target detection method according to the present application with reference to the flowcharts shown in fig. 9 and 10, respectively.
As shown in fig. 9, one possible target detection flow step may include:
(1) collecting an image sequence; for example, a vehicle-mounted photographing device photographs a road in front of a running state to obtain an image sequence of a road image;
(2) detecting a first frame image element; for example, the collected images are feature extracted using deep learning and convolutional neural networks to detect traffic signs in the resulting images, including the image locations of the traffic signs, borders in the images, and the like.
(3) Continuous positioning detection of a streaming prediction target; the process may include: searching an image area, predicting streaming characteristics and fusing streaming iteration;
for example, for the image region searching step, after detecting the target object in the first frame image, the region to be searched in the second needle image may be initially defined based on the region of the target object in the first frame image, for example, when the region where the target object is located in the first frame image is determined to be w×h, because of the continuity of the image sequence, the mapping region corresponding to the region is enlarged in the second frame image, and the region corresponding to 3w×3h in the second frame image is obtained as the region to be searched.
As shown in fig. 10, for the step of stream feature prediction, after obtaining the first object region in the first frame image and the corresponding region to be searched in the second frame image, the first object region and the region to be searched may be cut out from the image, and the cut first object region and the cut region to be searched may be input into the convolutional neural network, to obtain the feature matrix w×h×d corresponding to the region w×h and the feature matrix 3w×3h×3d corresponding to the region 3w×3h. Then, taking the feature matrix w×h×d corresponding to the region w×h as a convolution kernel, and performing convolution operation on the feature matrix 3w×3h×3d corresponding to the region 3w×3h to obtain a feature map with 3×3x3 dimensions; based on the characteristic diagram with 3X 3 dimensions, the target stream offset of the area where the target object is located relative to the area W X H in the second frame image can be obtained, i.e. offset position information (Δx, Δy); and obtaining the target coordinate position (x, y) of the region where the target object is located in the second frame image, thereby positioning the target object in the second frame image.
For example, for the step of stream iterative fusion, after the position of the region where the target object is located in the second frame image is obtained according to the first object region where the target object is located in the first frame image; and the positioning result of each frame of image of the target object in the image sequence can be obtained based on the same way of the process.
In addition, the initial positioning result corresponding to the current frame of image to be detected can be filtered by combining the positioning results of the multiple frames of images, so that a more accurate positioning result is obtained. The steps are iterated through continuous loops, so that continuous positioning detection of the target object can be accurately performed.
(4) Outputting an image; the second image may initially carry position-marker information of the location of the region where the target object is located.
According to the target detection method provided by the embodiment of the application, the first object region of the first image is used for initially delineating the region to be searched of the second image, and the offset position of the target object corresponding to the first image to the second image is obtained by utilizing the first image feature of the first object region, the second image feature of the region to be searched and the position feature; the offset position and the first positioning information of the first object region can then be utilized to position the target object in the second image; thereby realizing accurate positioning of the target object in the area to be searched. The application realizes the continuous positioning detection of the target object in the multi-frame images by utilizing the continuity between the front image and the rear image in the image sequence without depending on the image quality, thereby greatly improving the accuracy of target detection. In addition, the limitation on the conditions such as the balance of the sample image and the category of the element object is eliminated, and the practicability and the robustness of target detection are further improved.
And the initial positioning result of the target object corresponding to the second image can be filtered by combining the positioning result of the multi-frame third image, so that the influence of error factors is reduced, a more accurate positioning result is obtained, and the accuracy and the robustness of target detection are further improved.
Two possible application scenarios of the target detection method of the present application are illustrated in the flowcharts shown in fig. 11 and 12, respectively.
Fig. 11 is a schematic view of a scene flow of a target detection method according to an embodiment of the present application, where an execution subject of the method is a vehicle-mounted terminal. As shown in fig. 11, the process may include the steps of:
step 1101, responding to a starting instruction of a vehicle, and periodically acquiring images of a front road in the running process of the vehicle by the vehicle-mounted terminal through the image acquisition equipment of the vehicle to obtain an image sequence;
for example, during the running process of the vehicle, image acquisition can be performed on the front road in real time, so as to acquire an image sequence of the front road.
In one possible scenario example, the vehicle-mounted terminal may have a map class application installed therein, where the map class application may include, but is not limited to: vehicle navigator, map APP, etc. The map-like application may be provided with road traffic element detection services. For example, if the start command is received, the vehicle-mounted terminal may start the map-based application, for example, use the map-based application to perform route navigation, and may use the road traffic element detection service of the map-based application to detect and remind traffic elements in the road ahead during navigation.
The road traffic element detection service may be used to detect at least one traffic element in a road.
In one possible scenario application, the at least one traffic element may include a generic element and a personalized element. The road traffic element detection service may be provided with a general element detection service and a personalized element detection service for general element detection and personalized element detection, respectively.
In one scenario example, the generic element detection service may be turned on by default in a map class application. For example, a map-type application is started, that is, a general element such as a traffic sign and a road sign line in a road is detected by default.
In yet another scenario example, the personalized element detection service is a service that supports user on-demand triggering of detection for certain traffic elements. For example, the in-vehicle terminal may display element information of a plurality of traffic elements in a page of the map-like application for selection, and when a selection operation of a traffic element corresponding to any one of the element information is detected, the in-vehicle terminal starts a personalized element detection service for the selected traffic element. For example, the user may trigger detection of elements such as tourist area signboards for tourist attraction a or vehicle energy replenishment stations for road section B as required.
It should be noted that the at least one traffic element may include, but is not limited to: traffic signs, vehicle energy replenishment stations, pavement marking lines, vehicles or obstacles around the vehicle, temporary construction areas, pedestrians, buildings, etc.
For example, the traffic sign may be a road traffic sign for prompting driving behavior of a vehicle on a road. The traffic sign may include, but is not limited to, at least one of:
a warning sign board for prompting vehicles and pedestrians to pay attention to dangerous places in front of the road;
signboards for prohibiting or limiting certain traffic behaviors of vehicles and pedestrians;
a road indicating sign board for indicating the direction, the place and the distance information of the road in front;
tourist area signboards for indicating the direction and distance of the tourist spots in front;
the road construction safety signpost is used for indicating that a road in front is a construction area;
speed limit signboards for indicating the speed-reducing running of the vehicle.
For example, the vehicle energy replenishment station may be a gas station that provides energy replenishment for the vehicle, a charging stake, or a service station that provides service for the driver.
For example, the road marking may be an identification located in the road surface that directs the driving behavior of the road segment. Such as fire passage identification, speed reduction sign line, etc.
Step 1102, a vehicle-mounted terminal detects traffic elements of each frame image in the image sequence to obtain a first object area in the first image, wherein the target object is any one of at least one traffic element, and the first image is a first frame image containing the any one traffic element;
the vehicle-mounted terminal can detect traffic elements of each frame of image in the image sequence through a detection network so as to obtain the first object area.
For example, the frame-by-frame detection can be performed on the acquisition order of each frame image in the image sequence, and when the first frame of each frame image contains any one of the traffic elements for any one of the traffic elements, the first frame image is taken as the first image.
Step 1103, the vehicle-mounted terminal determines a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence, wherein the first object region is a region where a target object is located after the second image is located in the first image;
for example, the second image may be a next frame image of the first image in the sequence of images. The first object area is the area where the detected traffic element is located.
Step 1104, the vehicle-mounted terminal respectively determines a first image feature of the first object area, a second image feature and a position feature of the area to be searched;
Step 1105, the vehicle-mounted terminal determines an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
step 1106, the vehicle-mounted terminal obtains a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image;
it should be noted that, the implementation manner of the steps 1103-1106 is the same as the above-mentioned steps 201-204, and will not be described in detail here.
Two examples of scenarios regarding traffic element detection are provided below.
Scene one example one, map data in a map-like application may be updated with the positioning results of traffic elements. For example, the steps 1101-1106 are executed by a plurality of vehicle-mounted terminals in advance, so as to obtain positioning results corresponding to the traffic elements of each road section in the image sequence; and obtaining the geographic position of each traffic element by utilizing the positioning result corresponding to each traffic element, and adding each traffic element into the map data based on the geographic position of each traffic element.
For example, the process of detecting road traffic elements and updating map data may include steps E1-E6:
E1, the first equipment sends road traffic element detection instructions to a plurality of preconfigured vehicle-mounted terminals;
the first device may be a device for updating map data; such as a background server of a map-like application or a corresponding computer center device, etc.
The plurality of vehicle-mounted terminals are used for detecting traffic elements of a plurality of roads. For example, each vehicle-mounted terminal corresponds to a respective road traffic element detection instruction that instructs the vehicle-mounted terminal to perform traffic element detection for the corresponding road.
E2, each vehicle-mounted terminal receives a corresponding road traffic element detection instruction;
and E3, for each vehicle-mounted terminal, controlling the vehicle to run on the corresponding road based on the received road traffic element detection instruction, collecting an image sequence of the road in the running process, and executing steps 1101-1106 based on the collected image sequence to obtain a positioning result corresponding to each traffic element in the road corresponding to the vehicle.
In performing steps 1101-1106, at least one traffic element detection may be performed on each frame of image in the image sequence based on steps 1101-1102; whenever the first frame image of any traffic element is detected, the positioning result of the any traffic element in the subsequent frame images is obtained based on steps 1103-1106, so as to realize continuous positioning of the detected traffic element. Based on this, each traffic element existing in the road can be detected, and the continuous positioning result of each traffic element can be obtained.
Of course, during the vehicle driving, the vehicle terminal may also display the positioning result corresponding to the detected traffic element through step 1107.
E4, returning positioning results corresponding to the traffic elements to the first equipment by the vehicle-mounted terminals;
and E5, the first equipment calculates and obtains the geographic position corresponding to each traffic element based on the positioning result corresponding to each traffic element returned by each vehicle-mounted terminal.
For example, for each traffic element, the geographic coordinates of the traffic element in the geodetic coordinate system may be calculated in combination with the positioning information of the traffic element in the continuous multi-frame images.
It should be noted that, the geographic position corresponding to the detected traffic element may be calculated by each vehicle-mounted terminal based on the positioning result corresponding to the detected traffic element. Accordingly, steps E4-E5 may be replaced with: each vehicle-mounted terminal calculates and obtains the geographic position corresponding to the detected traffic element based on the positioning result corresponding to the detected traffic element; each vehicle-mounted terminal transmits the geographic position corresponding to the detected traffic element to the first equipment; and the first equipment receives the geographic positions corresponding to the traffic elements returned by the vehicle-mounted terminals.
For example, each vehicle-mounted terminal may also directly send the collected image sequence to the first device in real time, where the first device executes steps 1102-1106 to obtain a continuous positioning result of traffic elements existing in multiple roads, and obtain a geographic location of each traffic element based on the continuous positioning result. The present application does not limit the execution subject of the process of continuously locating traffic elements using image sequences.
And E6, the first equipment updates the road data in the map data based on the geographic positions corresponding to the traffic elements.
For example, a currently obtained traffic element in a road and the geographic coordinates of the traffic element are added to traffic element information associated with the road. The background server of the map class application may store updated map data.
In the updated map data obtained in the above steps E1 to E6, the traffic elements corresponding to the roads may be updated in each road data. Based on this, the map-type application may support various services related to the traffic element, such as a road traffic element detection service, a traffic element inquiry service, a service of reminding of a traffic element during navigation, and the like.
In a second scenario example, the map class application may be provided with road traffic element detection services. The vehicle-mounted terminal realizes continuous positioning of road elements through the service in real time. The service mode of the road traffic element detection service can comprise the following two conditions:
In case 1, the road traffic element detection service may be that a vehicle-mounted terminal acquires an image sequence in real time, executes steps 1102-1106 based on the image sequence acquired in real time, realizes continuous positioning detection of traffic elements in a road in front of the column, and displays a continuous positioning result of the traffic elements in real time through step 1107.
In case 2, the road traffic element detection service may be classified into a general element detection service and a personalized element detection service.
For the general element detection service, the steps E1 to E6 may be adopted in advance to update the general element in at least one traffic element into the map data. That is, in the map data provided by the map-like application, the general elements corresponding to the roads may be updated in each road data; for example, the road data may be associated with data such as the road identifier, the name of the general element of the road, the geographical position, the element icon, and the link image of the link on which the element is located.
Based on the above, the vehicle-mounted terminal can obtain the positioning result of the general element in the road through the general element detection service in the map application. For example, in the process that the vehicle runs on the current road, the vehicle-mounted terminal does not need to collect an image sequence in real time, but displays and continuously locates the general elements through map data provided by map applications, and can particularly acquire general element data associated with the road from road data of the map data, and display the locating result of the general elements existing in the road ahead based on the acquired data; for example, the name, geographical position, element icon, and road segment image of the road segment where the element is located of the general element in the road ahead are displayed, and the distance between the general element and the vehicle is voice-broadcast in real time.
For the personalized element detection service, the vehicle-mounted terminal can acquire an image sequence of a front road in real time in the running process of the vehicle, and execute steps 1102-1107 based on the acquired image sequence so as to continuously locate and detect the personalized element in the front road and display the personalized element in real time.
In one possible example, in case 2, the general element may be an element that is unchanged for a target period, such as a traffic sign, a vehicle energy replenishment station, a pavement marking line building, or the like. The personalized element may be an element that varies within a target period, such as a temporary construction area, a pedestrian, an obstacle, or the like, which is temporarily constructed for 1 day. The target period may be one month, half year, or one year from the current time, to which the present application is not limited.
It should be noted that the above examples only exemplify one division manner of the general element and the personalized element, and of course, the general element and the personalized element may be configured based on the need, and the present application does not limit the types of the traffic elements contained in the general element and the personalized element.
For another example, the element that is unchanged in the target period may also be used as a selection range of the personalized element, that is, when the personalized element detection service is performed, the vehicle-mounted terminal may display element information of a plurality of preconfigured traffic elements in the page of the map application, for example, a vehicle energy supplementing station, a traffic sign, a building, and the like that are unchanged in the target period may also be used as a selection range of the personalized element, so as to allow the user to select according to the needs.
Step 1107, the vehicle-mounted terminal displays the positioning result through at least one of the following: displaying a second image including an image position marker of the any one of the traffic elements through a display screen of the vehicle; the distance between any traffic element and the vehicle is broadcast through the sound of the vehicle.
Accordingly, the positioning result includes at least one of: the image position of any one of the traffic elements; or, the distance between the arbitrary traffic element and the vehicle is included.
For example, during the running process of the vehicle, the vehicle-mounted terminal can acquire an image sequence in real time, and position the traffic element in each frame image of the image sequence by the target detection method of the application for the second image and each frame image after the second image, so as to realize continuous positioning detection of the traffic element. According to the positioning result, the specific condition of the traffic sign in front can be prompted to the driver through a mode of marking the position of the sign through images and a mode of broadcasting the change of the distance between the sign and the vehicle through voice.
For example, each frame of image marked with the traffic element region frame may be displayed on the in-vehicle terminal screen. And the voice broadcasting of 100 meters away from the front XXX traffic sign, 50 meters away from the front XXX traffic sign and the like can be continuously performed according to the positioning result corresponding to each frame of image.
Fig. 12 is a schematic flow chart of another scenario of a target detection method according to an embodiment of the present application, where an execution subject of the method is a monitoring device. As shown in fig. 12, the process may include the steps of:
step 1201, responding to a detection instruction of a target road section, and periodically performing image acquisition on the target road section by using the monitoring device through the image acquisition device associated with the target road section to obtain an image sequence.
Step 1202, the monitoring device detects a driving behavior of the vehicle for each frame of image in the image sequence, and obtains a first object area in the first image.
The monitoring device may detect driving behavior of the vehicle on each frame of the image in the image sequence through a detection network to obtain the first object region.
Wherein the target object is a target vehicle having a preconfigured driving behavior, and the first image is a first frame image containing the target vehicle.
The monitoring device may include, but is not limited to: road condition monitors, monitoring devices in traffic monitoring systems, etc. The monitoring device can be preconfigured with the association relation between a plurality of road sections and the image acquisition device, and can monitor traffic conditions, vehicle driving conditions and the like of the corresponding road sections through the image acquisition device associated with each road section.
For example, the preconfigured driving behavior may be a driving behavior that does not comply with traffic regulations.
Step 1203, the monitoring device determines, based on a first object region of a first image in the image sequence, a region to be searched for of a second image in the image sequence.
The second image is subsequent to the first image, and the first object area is an area where the target object is located.
Step 1204, the monitoring device determines a first image feature of the first object region, and a second image feature and a position feature of the region to be searched, respectively;
step 1205, the monitoring device determines an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
step 1206, the monitoring device obtains a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image;
it should be noted that, the implementation manner of the steps 1203-1206 is the same as the above-mentioned steps 201-204, and will not be described in detail here.
Step 1207, the monitoring device displays at least one of the following: displaying a second image including a location marker of the target vehicle; and displaying target prompt information.
Wherein the position mark in the second image comprises: at least one of an image position of the target vehicle in the second image, or a distance between the target vehicle and the image capturing device; the positioning result includes a position marker of the target vehicle.
The target prompt information includes vehicle identification information of the target vehicle and preconfigured driving behavior of the target vehicle in the target road section.
For example, an image sequence of a certain intersection is obtained in real time by using an electronic eye, a detector, or the like located at the position of the intersection. The target detection method can be combined to realize the positioning of illegal vehicles, continuous detection of the driving track and the like aiming at the illegal driving vehicles, the illegal parking vehicles and the like at the crossroad.
It should be noted that, in the embodiment of the present application, only the scene flows shown in the steps 1101 to 1107 or the scene flows shown in the steps 1201 to 1207 are described by way of example. The target detection method can be applied to other possible scenes in the traffic field and the map field.
For example, the target detection method can be applied to an intelligent transportation system or an intelligent vehicle-road cooperative system to support technical services of the intelligent transportation system or the intelligent vehicle-road cooperative system.
The intelligent transportation system (Intelligent Traffic System, ITS), also called intelligent transportation system (Intelligent Transportation System), is a comprehensive transportation system which uses advanced scientific technology (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence, etc.) effectively and comprehensively for transportation, service control and vehicle manufacturing, and enhances the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy.
The intelligent vehicle-road cooperative system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS), which is simply called a vehicle-road cooperative system, is one development direction of an Intelligent Transportation System (ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, carries out vehicle-vehicle and vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, fully realizes effective cooperation of people and vehicles and roads, ensures traffic safety, improves traffic efficiency, and forms a safe, efficient and environment-friendly road traffic system.
In addition, the target detection method provided by the application can also relate to the technical fields of artificial intelligence technology, computer vision technology, automatic driving technology and the like.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.
It can be appreciated that the automatic driving technology generally comprises high-precision map, environment awareness, behavior decision, path planning, motion control and other technologies, and the automatic driving technology has wide application prospect.
It will be appreciated that Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace the machine Vision such as identifying and measuring a target with a camera and a Computer instead of human eyes, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument for detection. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.
It is understood that artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Fig. 13 is a schematic structural diagram of an object detection device according to an embodiment of the present application. As shown in fig. 13, the apparatus includes:
the region determining module 1301 is configured to determine, based on a first object region of a first image in an image sequence, a region to be searched for of a second image in the image sequence, where the first object region is a region where a target object is located, after the first image;
a feature determining module 1302, configured to determine a first image feature of the first object region, and a second image feature and a position feature of the region to be searched, respectively;
an offset position determining module 1303, configured to determine an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature, and the position feature;
The positioning module 1304 is configured to obtain a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
In one possible implementation, the offset position determining module 1303 includes:
a similarity determining unit configured to determine a similarity between the first object region and at least one candidate region in the region to be searched based on the first image feature and the second image feature;
an offset position determining unit configured to determine, based on the first image feature and the position feature, a region offset position corresponding to each of the candidate regions, the region offset position representing an offset between a position of the candidate region in the image mapping region corresponding to the first image and a position in the image mapping region corresponding to the second image;
the offset position determining unit is further configured to obtain an area offset position corresponding to a second object area based on the similarity and the area offset position corresponding to each candidate area, and take the area offset position corresponding to the second object as the offset position corresponding to the target object; the second object region is a region including the target object in each of the candidate regions.
In one possible implementation, the similarity determining unit is configured to:
taking the first image feature as a convolution kernel, and carrying out convolution operation on the second image feature to obtain the similarity corresponding to each candidate region;
wherein, one candidate area is an area corresponding to one sliding of the convolution kernel in the second image feature in the convolution operation process, and the size of each candidate area is the same as that of the first object area;
the offset position determining unit is used for:
and taking the first image feature as a convolution kernel, and carrying out convolution operation on the position feature to obtain the region offset position corresponding to each candidate region.
In one possible implementation, the offset position determining unit is further configured to:
determining a second object region which meets the similarity condition in each candidate region based on the similarity corresponding to each candidate region;
and screening the region offset positions corresponding to the second object region from the region offset positions corresponding to the candidate regions.
In one possible implementation, the positioning module 1304 is configured to:
obtaining an initial positioning result corresponding to the target object based on the offset position corresponding to the target object and first positioning information of the first object area corresponding to the first image;
And filtering the initial positioning result based on the initial positioning result and a positioning result of the target object corresponding to at least one frame of third image, so as to obtain a positioning result of the target object corresponding to the second image, wherein the at least one frame of third image is an image positioned in front of the second image in the image sequence.
In one possible implementation, the positioning module 1304 is configured to:
performing offset processing on the first positioning information of the first object region corresponding to the first image based on the offset position corresponding to the target object to obtain second positioning information of the second object region corresponding to the second image;
and obtaining the initial positioning result based on the second positioning information.
In one possible implementation, the positioning module 1304 is configured to either:
taking the second positioning information as the initial positioning result;
and adjusting the second object area into a third object area based on the target adjustment parameter and the second positioning information, and taking third positioning information corresponding to the third object area in the second image as the initial positioning result.
In one possible implementation, the offset positions corresponding to the target object include a first offset position and a second offset position;
The first offset position represents the offset of the target object at the image positions corresponding to the first image and the second image respectively, and the second offset position represents the offset of the distance between the target object and the image acquisition equipment;
the positioning module 1304 is configured to:
performing offset processing on a first image position in the first positioning information to obtain a second image position based on the first offset position, and performing offset processing on a first distance in the first positioning information to obtain a second positioning information comprising the second image position and the second distance based on the second offset position;
the positioning module 1304 is further configured to:
and based on the target adjustment parameters, respectively adjusting the first positioning point position and the first area size in the second image position into a second positioning point position and a second area size, and obtaining the third positioning information, wherein the second positioning point position and the second area size are the positioning point position and the positioning point size corresponding to the third object area.
In one possible implementation, the positioning module 1304 is further configured to:
determining weights respectively corresponding to the third image and the second image of each frame based on the image interval between the third image and the second image of each frame;
And weighting the initial positioning result and the positioning result corresponding to the third image of each frame based on the weights corresponding to the third image of each frame and the second image respectively to obtain the positioning result of the target object corresponding to the second image.
In one possible implementation, the feature determination module 1302 is configured to:
respectively extracting a first image feature of the first object region and a second image feature of the region to be searched through a trained first extraction network;
and extracting the position characteristics of the region to be searched through a trained second extraction network, wherein the position characteristics represent the image positions of all pixel points in the region to be searched and the distance between the image acquisition device and the position characteristics.
In one possible implementation manner, the first object area where the target object in the first image is located is obtained by any one of the following:
object detection is carried out on the first image, and a first object area where a target object in the first image is located is obtained;
and obtaining a first object region in which the target object is located in the first image based on the region in which the target object is located in a fourth image and the offset position of the target object corresponding to the first image from the fourth image, wherein the fourth image is an image before the first image in the image sequence.
In one possible implementation manner, the apparatus further includes a first acquisition module, which is specifically configured to, before determining, based on the first object region of the first image in the image sequence, the region to be searched for in the second image in the image sequence:
responding to a starting instruction of a vehicle, and periodically acquiring images of a front road in the running process of the vehicle through image acquisition equipment of the vehicle to obtain an image sequence;
detecting traffic elements of each frame of image in the image sequence to obtain a first object area in the first image, wherein the target object is any one of at least one traffic element, and the first image is a first frame of image containing the any one traffic element;
the device further comprises a first display module, wherein the first display module is specifically used for at least one of the following after obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and first positioning information of the first object area corresponding to the first image:
displaying, via a display screen of the vehicle, a second image including an image position marker of the any one of the traffic elements, the positioning result including an image position of the any one of the traffic elements;
And broadcasting the distance between any traffic element and the vehicle through the sound of the vehicle, wherein the positioning result comprises the distance between any traffic element and the vehicle.
In one possible implementation manner, the apparatus further includes a second acquisition module, where the second acquisition module is specifically configured to, before the determining, based on the first object region of the first image in the image sequence, a region to be searched for in the second image in the image sequence:
responding to a detection instruction of a target road section, and periodically carrying out image acquisition on the target road section through image acquisition equipment associated with the target road section to obtain an image sequence;
detecting the driving behavior of a vehicle on each frame of image in the image sequence to obtain a first object area in the first image, wherein the target object is a target vehicle with preconfigured driving behavior, and the first image is a first frame image containing the target vehicle;
the device further comprises a second display module, wherein the second display module is specifically used for at least one of the following after obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image:
Displaying a second image including a position marker of the target vehicle, the position marker including at least one of an image position of the target vehicle in the second image or a distance between the target vehicle and the image capture device; the positioning result comprises a position mark of the target vehicle;
and displaying target prompt information, wherein the target prompt information comprises vehicle identification information of the target vehicle and preconfigured driving behaviors of the target vehicle on the target road section.
In one possible implementation, the area determining module 1301 is configured to:
determining an image mapping area corresponding to the first object area in the second image based on a first image position corresponding to the first object area;
and scaling the region range of the image mapping region corresponding to the first object region based on a preset scaling coefficient to obtain the region to be searched.
According to the target detection device provided by the embodiment of the application, the to-be-searched area of the second image is initially defined based on the first object area of the first image, and the offset position of the target object corresponding to the first image to the second image is obtained by utilizing the first image characteristic of the first object area, the second image characteristic of the to-be-searched area and the position characteristic; the offset position and the first positioning information of the first object region can then be utilized to position the target object in the second image; thereby realizing accurate positioning of the target object in the area to be searched. The application realizes the continuous positioning detection of the target object in the multi-frame images by utilizing the continuity between the front image and the rear image in the image sequence without depending on the image quality, thereby greatly improving the accuracy of target detection. And the limitation of conditions such as the equilibrium of a sample image and the richness of element object types is removed, so that the practicability and the robustness of target detection are improved.
And the initial positioning result of the target object corresponding to the second image can be filtered by combining the positioning result of the multi-frame third image, so that the influence of error factors is reduced, a more accurate positioning result is obtained, and the accuracy and the robustness of target detection are further improved.
The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.
Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 14, the electronic device includes: a memory, a processor, and a computer program stored on the memory, the processor executing the above computer program to implement the steps of the target detection method, the steps being implemented in comparison with the related art:
according to the target detection method provided by the embodiment of the application, the to-be-searched area of the second image is initially defined based on the first object area of the first image, and the offset position of the target object corresponding to the first image to the second image is obtained by utilizing the first image characteristic of the first object area, the second image characteristic of the to-be-searched area and the position characteristic; the offset position and the first positioning information of the first object region can then be utilized to position the target object in the second image; thereby realizing accurate positioning of the target object in the area to be searched. The application realizes the continuous positioning detection of the target object in the multi-frame images by utilizing the continuity between the front image and the rear image in the image sequence without depending on the image quality, thereby greatly improving the accuracy of target detection. And the limitation of conditions such as the equilibrium of a sample image and the richness of element object types is removed, so that the practicability and the robustness of target detection are improved.
And the initial positioning result of the target object corresponding to the second image can be filtered by combining the positioning result of the multi-frame third image, so that the influence of error factors is reduced, a more accurate positioning result is obtained, and the accuracy and the robustness of target detection are further improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 14, the electronic device 1400 shown in fig. 14 includes: a processor 1401 and a memory 1403. In which a processor 1401 is coupled to a memory 1403, such as via a bus 1402. Optionally, the electronic device 1400 may also include a transceiver 1404, where the transceiver 1404 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 1404 is not limited to one, and the structure of the electronic device 1400 is not limited to the embodiment of the present application.
The processor 1401 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 1401 may also be a combination that performs computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
Bus 1402 may include a path that communicates information between the components. Bus 1402 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 1402 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.
Memory 1403 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory ), a CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media\othermagnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 1403 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 1401. The processor 1401 is arranged to execute a computer program stored in the memory 1403 to implement the steps shown in the foregoing method embodiments.
Among them, electronic devices include, but are not limited to: a server, a terminal, or a cloud computing center device, etc.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. The terms "comprises" and "comprising" as used in embodiments of the present application mean that the corresponding features may be implemented as presented features, information, data, steps, operations, but do not exclude the implementation as other features, information, data, steps, operations, etc. supported by the state of the art.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.
The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims (15)

1. A method of target detection, the method comprising:
determining a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence, wherein the first object region is a region where a target object is located after the first image;
respectively determining a first image characteristic of the first object area, a second image characteristic and a position characteristic of the area to be searched;
determining an offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
and obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
2. The method of claim 1, wherein the determining an offset position of the target object from the first image to the second image based on the first image feature, the second image feature, and the position feature comprises:
determining a similarity between the first object region and at least one candidate region of the regions to be searched based on the first image feature and the second image feature;
determining a region offset position corresponding to each candidate region based on the first image feature and the position feature, wherein the region offset position represents the offset between the position of the candidate region in the image mapping region corresponding to the first image and the position corresponding to the second image;
obtaining a region offset position corresponding to a second object region based on the similarity and the region offset position corresponding to each candidate region, and taking the region offset position corresponding to the second object as the offset position corresponding to the target object; the second object region is a region including the target object in each of the candidate regions.
3. The method of claim 2, wherein the determining a similarity between the first object region and at least one candidate region of the regions to be searched based on the first image feature and the second image feature comprises:
Performing convolution operation on the second image features by taking the first image features as convolution kernels to obtain the similarity corresponding to each candidate region;
wherein, one candidate area is an area corresponding to one sliding of the convolution kernel in the second image feature in the convolution operation process, and the size of each candidate area is the same as that of the first object area;
the determining the region offset position corresponding to each candidate region based on the first image feature and the position feature includes:
and carrying out convolution operation on the position features by taking the first image features as convolution kernels to obtain the region offset positions corresponding to the candidate regions.
4. A method according to claim 2 or 3, wherein the obtaining the region offset position corresponding to the second object region based on the similarity and the region offset position corresponding to each candidate region includes:
determining a second object region meeting the similarity condition in each candidate region based on the similarity corresponding to each candidate region;
and screening the region offset positions corresponding to the second object region from the region offset positions corresponding to the candidate regions.
5. The method of claim 1, wherein the obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and first positioning information of the first object region corresponding to the first image includes:
obtaining an initial positioning result corresponding to the target object based on the offset position corresponding to the target object and first positioning information of a first object area corresponding to the first image;
and filtering the initial positioning result based on the initial positioning result and a positioning result of the target object corresponding to at least one frame of third image, so as to obtain a positioning result of the target object corresponding to the second image, wherein the at least one frame of third image is an image positioned before the second image in the image sequence.
6. The method of claim 5, wherein the obtaining the initial positioning result corresponding to the target object based on the offset position corresponding to the target object and the first positioning information corresponding to the first object region in the first image includes:
performing offset processing on first positioning information of the first object region corresponding to the first image based on the offset position corresponding to the target object to obtain second positioning information of the second object region corresponding to the second image;
And obtaining the initial positioning result based on the second positioning information.
7. The method of claim 6, wherein the obtaining the initial positioning result based on the second positioning information comprises any one of:
taking the second positioning information as the initial positioning result;
and adjusting the second object area into a third object area based on the target adjustment parameter and the second positioning information, and taking third positioning information of the third object area corresponding to the second image as the initial positioning result.
8. The method of claim 7, wherein the offset position corresponding to the target object comprises a first offset position and a second offset position;
the first offset position represents the offset of the target object at the image positions corresponding to the first image and the second image respectively, and the second offset position represents the offset of the distance between the target object and the image acquisition equipment;
performing offset processing on first positioning information of the first object region corresponding to the first image based on the offset position corresponding to the target object to obtain second positioning information of the second object region corresponding to the second image, where the offset processing includes:
Performing offset processing on a first image position in the first positioning information to obtain a second image position based on the first offset position, and performing offset processing on a first distance in the first positioning information to obtain a second positioning information comprising the second image position and the second distance based on the second offset position;
the adjusting the second object area to a third object area based on the target adjustment parameter and the second positioning information includes:
and based on the target adjustment parameters, respectively adjusting the first positioning point position and the first area size in the second image position into a second positioning point position and a second area size, and obtaining the third positioning information, wherein the second positioning point position and the second area size are the positioning point position and the positioning point size corresponding to the third object area.
9. The method of claim 5, wherein filtering the initial positioning result based on the initial positioning result and a positioning result of the target object corresponding to at least one third image to obtain a positioning result of the target object corresponding to the second image comprises:
Determining weights corresponding to the third images of the frames and the second images respectively based on the image intervals between the third images of the frames and the second images;
and weighting the initial positioning result and the positioning result corresponding to the third image of each frame based on the weights corresponding to the third image of each frame and the second image respectively to obtain the positioning result of the target object corresponding to the second image.
10. The method of claim 1, wherein the first object region in the first image where the target object is located is obtained by any one of:
performing object detection on the first image to obtain a first object region where a target object in the first image is located;
and obtaining a first object region in which the target object is located in the first image based on a region in which the target object is located in a fourth image and an offset position of the target object corresponding to the fourth image to the first image, wherein the fourth image is an image before the first image in the image sequence.
11. The method of claim 1, wherein the determining the region to be searched for a second image in the sequence of images is preceded by determining the region to be searched for a second image in the sequence of images based on a first object region for the first image in the sequence of images, the method further comprising:
Responding to a starting instruction of a vehicle, and periodically acquiring images of a front road in the running process of the vehicle through image acquisition equipment of the vehicle to obtain an image sequence;
detecting traffic elements of each frame of image in the image sequence to obtain a first object area in the first image, wherein the target object is any one of at least one traffic element, and the first image is a first frame of image containing the any one traffic element;
the method further comprises at least one of the following steps of:
displaying, by a display screen of the vehicle, a second image including an image position marker of the any one of the traffic elements, the positioning result including an image position of the any one of the traffic elements;
and broadcasting the distance between any traffic element and the vehicle through the sound equipment of the vehicle, wherein the positioning result comprises the distance between any traffic element and the vehicle.
12. The method of claim 1, wherein the determining the region to be searched for a second image in the sequence of images is preceded by determining the region to be searched for a second image in the sequence of images based on a first object region for the first image in the sequence of images, the method further comprising:
responding to a detection instruction of a target road section, and periodically carrying out image acquisition on the target road section through image acquisition equipment associated with the target road section to obtain an image sequence;
detecting the driving behavior of a vehicle on each frame of image in the image sequence to obtain a first object area in the first image, wherein the object is a target vehicle with a preset driving behavior, and the first image is a first frame image containing the target vehicle;
and after obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image, executing at least one of the following by a display screen associated with the target road section:
displaying a second image comprising a location marker of the target vehicle, the location marker comprising at least one of an image location of the target vehicle in the second image or a distance between the target vehicle and an image acquisition device; the positioning result comprises a position mark of the target vehicle;
And displaying target prompt information, wherein the target prompt information comprises vehicle identification information of the target vehicle and preconfigured driving behaviors of the target vehicle in the target road section.
13. An object detection device, the device comprising:
the image processing device comprises a region determining module, a search module and a search module, wherein the region determining module is used for determining a region to be searched of a second image in an image sequence based on a first object region of a first image in the image sequence, and the first object region is a region where a target object is located after the second image is located in the first image;
the feature determining module is used for determining a first image feature of the first object area, a second image feature and a position feature of the area to be searched respectively;
the offset position determining module is used for determining the offset position of the target object corresponding to the first image to the second image based on the first image feature, the second image feature and the position feature;
and the positioning module is used for obtaining a positioning result of the target object corresponding to the second image based on the offset position corresponding to the target object and the first positioning information of the first object area corresponding to the first image.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the object detection method of any one of claims 1 to 12.
15. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method of any of claims 1 to 12.
CN202311224701.9A 2023-09-21 2023-09-21 Target detection method, target detection device, electronic equipment and storage medium Pending CN116958915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311224701.9A CN116958915A (en) 2023-09-21 2023-09-21 Target detection method, target detection device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311224701.9A CN116958915A (en) 2023-09-21 2023-09-21 Target detection method, target detection device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116958915A true CN116958915A (en) 2023-10-27

Family

ID=88460548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311224701.9A Pending CN116958915A (en) 2023-09-21 2023-09-21 Target detection method, target detection device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116958915A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161314A (en) * 2019-12-17 2020-05-15 中国科学院上海微系统与信息技术研究所 Target object position area determining method and device, electronic equipment and storage medium
CN113628250A (en) * 2021-08-27 2021-11-09 北京澎思科技有限公司 Target tracking method and device, electronic equipment and readable storage medium
CN113869163A (en) * 2021-09-18 2021-12-31 北京远度互联科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN115661444A (en) * 2022-10-19 2023-01-31 腾讯科技(深圳)有限公司 Image processing method, device, equipment, storage medium and product
CN115761655A (en) * 2022-11-17 2023-03-07 浙江大华技术股份有限公司 Target tracking method and device
CN116580054A (en) * 2022-01-29 2023-08-11 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161314A (en) * 2019-12-17 2020-05-15 中国科学院上海微系统与信息技术研究所 Target object position area determining method and device, electronic equipment and storage medium
CN113628250A (en) * 2021-08-27 2021-11-09 北京澎思科技有限公司 Target tracking method and device, electronic equipment and readable storage medium
CN113869163A (en) * 2021-09-18 2021-12-31 北京远度互联科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN116580054A (en) * 2022-01-29 2023-08-11 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and medium
CN115661444A (en) * 2022-10-19 2023-01-31 腾讯科技(深圳)有限公司 Image processing method, device, equipment, storage medium and product
CN115761655A (en) * 2022-11-17 2023-03-07 浙江大华技术股份有限公司 Target tracking method and device

Similar Documents

Publication Publication Date Title
US11967109B2 (en) Vehicle localization using cameras
JP6494719B2 (en) Traffic signal map creation and detection
WO2017193933A1 (en) Traffic accident pre-warning method and traffic accident pre-warning device
US11410389B2 (en) Point cloud display method and apparatus
CN111582189B (en) Traffic signal lamp identification method and device, vehicle-mounted control terminal and motor vehicle
WO2020043081A1 (en) Positioning technique
CN105659304A (en) Vision augmented navigation
US20200219389A1 (en) Visualiziing real-time intersection occupancy and calculated analytics in 3d
CN110967018B (en) Parking lot positioning method and device, electronic equipment and computer readable medium
CN115164918B (en) Semantic point cloud map construction method and device and electronic equipment
CN111319560B (en) Information processing system, program, and information processing method
CN115205382A (en) Target positioning method and device
CN110738105A (en) method, device, system and storage medium for calculating urban street cell pedestrian flow based on deep learning
CN112418081B (en) Method and system for quickly surveying traffic accidents by air-ground combination
CN113902047B (en) Image element matching method, device, equipment and storage medium
CN116959262A (en) Road traffic control method, device, equipment and storage medium
CN116958915A (en) Target detection method, target detection device, electronic equipment and storage medium
CN113220805B (en) Map generation device, recording medium, and map generation method
CN115249345A (en) Traffic jam detection method based on oblique photography three-dimensional live-action map
CN113642533B (en) Lane level positioning method and electronic equipment
CN117853317A (en) Data desensitization method, device, electronic equipment and storage medium
CN117271687A (en) Track playback method, track playback device, electronic equipment and storage medium
CN116266430A (en) Parking route recommendation method, device, computer equipment and storage medium
CN115619857A (en) Data acquisition method, device, electronic equipment, medium and program product
CN112446234A (en) Position determination method and device based on data association

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination