CN112634368A

CN112634368A - Method and device for generating space and OR graph model of scene target and electronic equipment

Info

Publication number: CN112634368A
Application number: CN202011570367.9A
Authority: CN
Inventors: 王颖
Original assignee: Xian Cresun Innovation Technology Co Ltd
Current assignee: Xian Cresun Innovation Technology Co Ltd
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-09

Abstract

The invention discloses a method and a device for generating a space and OR graph model of a scene target, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a scene video aiming at a preset scene; detecting targets in the scene video by using a pre-trained YOLO _ v3 network to obtain position information attribute information which respectively corresponds to each target in each frame of image of the scene video and comprises a boundary box containing the targets; matching the same target in each frame of image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image; determining the actual space distance between different targets in each frame of image; and generating a space and OR graph model of the preset scene by using the attribute information of the target corresponding to each frame of image after matching and the actual space distance. According to the invention, the YOLO _ v3 network is adopted for target detection, so that the precision and the efficiency of target detection can be improved, and the accuracy and the real-time performance of the space and OR graph model of a scene are improved.

Description

Method and device for generating space and OR graph model of scene target and electronic equipment

Technical Field

The invention belongs to the field of image representation, and particularly relates to a method and a device for generating a space and OR graph model of a scene target and electronic equipment.

Background

The And-Or Graph (AOG) is a hierarchical composition model of a random context free grammar (SCSG) that represents a hierarchical decomposition from the top level to leaf nodes by a set of terminal And non-terminal nodes, outlining the basic concepts in the image grammar. Where the and node represents the target decomposition, or the node represents an alternative sub-configuration.

For And-Or graphs, a small part dictionary is used to represent objects in an image by And nodes And Or node hierarchies of the And-Or graph, And such a model can embody a Spatial combination structure of the objects in the image, And can also be referred to as a Spatial And-Or-graph (S-AOG) model. The space and or graph model represents the target by hierarchically combining components of the target in different spatial configurations based on the spatial positional relationship of the target. Therefore, the method can be used for analyzing the position relation of each target in image analysis, and specific applications such as target positioning and tracking are achieved. For example, the target identification and positioning of complex scenes such as traffic intersections and squares can be realized.

However, the current space and OR diagram model is not accurate and efficient enough for positioning the space position relation of the target, so that the accuracy and the real-time performance of the model are not high.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for generating a space and or graph model of a scene target, electronic equipment and a storage medium, so as to achieve the purpose of improving the accuracy and the real-time performance of the space and or graph model of the scene. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for generating a spatial and-or graph model of a scene object, where the method includes:

acquiring a scene video aiming at a preset scene;

detecting the targets in the scene video by using a pre-trained YOLO _ v3 network to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the target;

matching the same target in each frame of image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

determining the actual space distance between different targets in each frame of image;

and generating a space and OR graph model of the preset scene by using the attribute information of the target corresponding to each matched frame image and the actual space distance.

Optionally, the attribute information further includes:

category information of the object.

Optionally, the preset multi-target tracking algorithm includes:

the deep sort algorithm.

Optionally, the determining the actual spatial distance between different targets in each frame of image includes:

in each frame image, determining the pixel coordinate of each target;

aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

and aiming at each frame image, obtaining the actual space distance between every two targets in the frame image by using the actual coordinates of every two targets in the frame image.

and aiming at each frame of image, obtaining the actual spatial distance between the two targets in the frame of image by using a depth camera ranging method.

Optionally, the preset scenario includes:

and (6) traffic intersection.

Optionally, the pre-trained YOLO _ v3 network is trained according to a MARS dataset and a Car-Reid dataset.

In a second aspect, an embodiment of the present invention provides an apparatus for generating a spatial and/or map model of a scene object, where the apparatus includes:

the video acquisition module is used for acquiring a scene video aiming at a preset scene;

the target detection module is used for detecting the targets in the scene video by utilizing a pre-trained YOLO _ v3 network to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the target;

the target tracking module is used for matching the same target in each frame of image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

the distance calculation module is used for determining the actual spatial distance between different targets in each frame of image;

and the model generation module is used for generating a space and OR model of the preset scene by using the attribute information of the target corresponding to each matched frame image and the actual space distance.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, wherein,

the memory is used for storing a computer program;

the processor is configured to implement the steps of the method for generating the space and/or graph model of the scene object provided by the embodiment of the present invention when executing the program stored in the memory.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for generating a space and/or map model of a scene object provided by the embodiment of the present invention.

Compared with the prior art that the position information of the target is obtained by using a traditional target detection algorithm, the method provided by the embodiment of the invention adopts the pre-trained YOLO _ v3 network to detect the target, so that the precision and the efficiency of target detection can be improved, and the purpose of improving the accuracy and the real-time performance of the space and/or graph model of the scene is realized.

Drawings

Fig. 1 is a schematic flowchart of a method for generating a space and/or graph model of a scene object according to an embodiment of the present invention;

FIG. 2 is a schematic view of a space and/or map of a traffic intersection as an example of an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a spatial and-or graph model generating apparatus for a scene object according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to achieve the purpose of improving the accuracy and the real-time performance of a space and or graph model of a scene, the embodiment of the invention provides a space and or graph model generation method of a scene target.

It should be noted that an executing subject of the method for generating a space and or graph model of a scene object according to the embodiment of the present invention may be a device for generating a space and or graph model of a scene object, where the device may be run in an electronic device. The electronic device may be a server or a terminal device, but is not limited thereto.

In a first aspect, a method for generating a spatial and/or graph model of a scene object according to an embodiment of the present invention is described.

As shown in fig. 1, a method for generating a spatial and/or graph model of a scene object according to an embodiment of the present invention may include the following steps:

and S1, acquiring a scene video aiming at the preset scene.

In the embodiment of the present invention, the preset scene refers to a scene at least including a moving object, and the object may be a human, a vehicle, an animal, or the like. For example, the preset scene may include a traffic intersection, a school, a park, and the like.

The scene video can be obtained by a video shooting device arranged at a preset scene fixing position, and the video shooting device can comprise a camera, a video camera, a mobile phone and the like, for example, the scene video can be obtained by shooting by a camera arranged on a viaduct of a traffic intersection.

The embodiment of the invention can acquire the scene video aiming at the preset scene from the video shooting equipment in a communication mode. The communication method is not limited to wireless communication, optical fiber communication, and the like.

It can be understood that the acquired scene video contains a plurality of frames of images.

S2, detecting the targets in the scene video by using a pre-trained YOLO _ v3 network, and obtaining attribute information corresponding to each target in each frame of image of the scene video.

The space and or map represents the target by hierarchically combining parts of the target in different spatial configurations based on the spatial positional relationship of the target. In the former space and or graph model, in order to determine the position relation of the target in the space, a traditional target detection algorithm is generally used, such as a front-back background segmentation algorithm, a target clustering algorithm, and the like. However, the above target detection algorithm is not accurate enough for positioning the spatial position relationship of the target, and the obtained model of the S-AOG represented by the target is not accurate enough. Meanwhile, the demand for rapid analysis and detection of targets is increasing. The detection efficiency of the traditional target detection algorithm cannot meet the requirement of real-time detection.

Therefore, the embodiment of the invention uses a regression-based neural network target detection algorithm to represent the YoLO _ v3 (YoLO Look one, YoLO) network for target detection.

The YOLO _ v3 network comprises a backbone network (backbone) and three prediction branches, wherein the backbone network is a dark net-53 network, the YOLO _ v3 is a full convolution network, a large number of layer jump connections using residual errors are used, and in order to reduce the negative effect of the gradient caused by POOLing, POOLing is abandoned, and stride of conv is used for realizing downsampling. In this network structure, a convolution with a step size of 2 is used for down-sampling. In order to enhance the accuracy of the algorithm for detecting small targets, in YOLO _ v3, an upsampling and Feature fusion method similar to FPN (Feature Pyramid) is adopted, and detection is performed on Feature maps of multiple scales. The three prediction branches adopt a full convolution structure.

For a specific detection process of the YOLO _ v3 network, please refer to the related description in the prior art, which is not described herein again.

Through a pre-trained YOLO _ v3 network, the attribute information corresponding to each target in each frame of image of the scene video can be obtained. Wherein the attribute information includes position information of a bounding box containing the object. The position information of the bounding box of the object is represented by (x, y, w, h), where (x, y) represents the center position coordinates of the current bounding box, and w and h represent the width and height of the bounding box, and those skilled in the art will appreciate that the attribute information includes, in addition to the position information of the bounding box, the confidence of the bounding box, which reflects the degree of confidence in the bounding box that contains the object, and the accuracy with which the bounding box predicts the object. The confidence is defined as:

if not, pr (object) is 0, confidence is 0; if it contains an object, pr (object) is 1, so the confidence level

The intersection ratio of the real bounding box and the predicted bounding box is obtained.

As will be understood by those skilled in the art, the attribute information also includes category information of the object. The categories include people, cars, animals, etc. to distinguish the kind of object to which the object belongs.

It should be noted that, since a frame of video image may often contain several objects, some objects are far away or too small in distance, or do not belong to "interested objects" in a preset scene, these objects are not objects with detection purpose. For example, in traffic crossing scenarios, moving vehicles and humans are of interest, while roadside fire hydrants are non-interesting targets. Thus, in a preferred embodiment, the preset number of targets can be detected for one frame of image by controlling and adjusting the YOLO _ v3 network setting in advance in the pre-training stage, for example, the preset number may be 30, 40, and so on. And meanwhile, the labeled training sample with detection purpose is used for training the YOLO _ v3 network, so that the YOLO _ v3 network has the autonomous learning performance, the trained YOLO _ v3 network can be used as a scene video of a test sample aiming at unknown objects, and attribute information corresponding to a preset number of objects with detection purpose in each frame of image can be obtained, so that the object detection efficiency and the detection purpose are improved.

As before, before S2, the YOLO _ v3 network needs to be trained in advance for a preset scene, and it can be understood by those skilled in the art that sample data used in the pre-training is sample scene video and annotation attribute information in the scene, where the annotation attribute information includes category information of an object and position information of a bounding box containing the object in each frame of image of the sample scene video.

The pre-training process can be briefly described as the following steps:

1) and taking the attribute information of each frame image of the sample scene video corresponding to the target as a true value corresponding to the frame image, and training each frame image and the corresponding true value through a YOLO _ v3 network to obtain a training result of each frame image.

2) And comparing the training result of each frame of image with the true value corresponding to the frame of image to obtain the output result corresponding to the frame of image.

3) And calculating the loss value of the network according to the output result corresponding to each frame of image.

4) And adjusting parameters of the network according to the loss value, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum value, which means that the training result of each frame of image is consistent with the true value corresponding to the frame of image, thereby completing the training of the network, namely obtaining the pre-trained YOLO _ v3 network.

For example, for a preset scene as a traffic intersection, the pre-trained YOLO _ v3 network is trained according to the MARS data set and the Vehicle weight recognition data set, namely, Vehicle Re-ID databases Collection. As will be appreciated by those skilled in the art, both the MARS dataset and the Vehicle Re-ID databases Collection are open source Datasets. Wherein the MARS data Set (Motion Analysis and Re-identification Set) is for pedestrians and the Vehicle Re-ID data collections is for vehicles.

For other scenes, a large number of sample scene videos are often required to be obtained in advance, manual or machine labeling is performed, category information and position information of a target corresponding to each frame of image in each sample scene video are obtained, and the YOLO _ v3 network has the target detection performance in the scene through a pre-training process.

And S3, matching the same target in each frame of image of the scene video by using a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image.

The early target detection and tracking mainly aims at pedestrian detection, the detection idea is mainly to realize detection according to a traditional characteristic point detection method, and then tracking is realized by filtering and matching characteristic points. Such as pedestrian detection based on histogram of oriented gradient features (HOG), early pedestrian detection achieves various problems of missing detection, false alarm, repeated detection and the like. With the development of deep convolutional neural networks in recent years, various methods for detecting and tracking targets by using high-precision detection results appear.

Since a plurality of targets exist in the preset scene, the target tracking needs to be realized by using a multi-target tracking algorithm. The multi-target tracking problem can be regarded as a data association problem, aiming at associating cross-frame detection results in a sequence of video frames. By tracking and detecting the targets in the scene video by using a preset multi-target tracking algorithm, the bounding box of each target in different frame images in the scene video and the ID (Identity document) of the target can be obtained.

In an optional implementation manner, the preset multi-target tracking algorithm may include: SORT (simple Online and Realtime tracking) algorithm.

The SORT algorithm uses a TDB (tracking-by-detection) method, the tracking means is to use Kalman filtering tracking to realize target motion state estimation, and the Hungarian assignment algorithm is used for position matching. The SORT algorithm does not use any object appearance features in the object tracking process, but only uses the position and size of the bounding box for motion estimation and data correlation of the object. Therefore, the complexity of the SORT algorithm is low, the tracker can realize the speed of 260Hz, the target tracking detection speed is high, and the real-time requirement in the scene video of the embodiment of the invention can be met.

The SORT algorithm does not consider the shielding problem, and does not perform target re-identification through the appearance characteristics of the target, so that the SORT algorithm is more suitable for being applied to the preset scene without shielding of the target.

In an optional implementation manner, the preset multi-target tracking algorithm may include: deepsort (simple online and real time tracking with a deep association metric) algorithm.

DeepSort is an improvement on the basis of SORT target tracking, a Kalman filtering algorithm is used for carrying out track preprocessing and state estimation and is associated with a Hungarian algorithm, a deep learning model trained on a line re-identification data set is introduced into the algorithm on the basis of improving the SORT algorithm, and nearest neighbor matching is carried out by extracting depth appearance features of targets in order to improve the shielding condition of the targets in a video and the problem of frequent switching of target IDs when the targets are tracked on a real-time video. The core idea of deep sort is to use recursive kalman filtering and data correlation between several frames for tracking. Deep Association Metric (Deep Association Metric) is added to the Deep SORT on the basis of the SORT, and the purpose is to distinguish different pedestrians. Appearance Information (Appearance Information) is added to realize target tracking of longer-time occlusion. The algorithm is faster and more accurate than the SORT speed in real-time multi-target tracking.

For the specific tracking procedure of the SORT algorithm and the DeepSort algorithm, please refer to the related prior art for understanding, and the detailed description is omitted here.

And S4, determining the actual spatial distance between different objects in each frame of image.

In many scenes, the position information of each target of each frame of image in the scene video can be obtained by performing target detection and tracking through the previous steps, but the position information of each target is not enough to represent the relation of each target in the scene. Therefore, this step requires determining the actual spatial distance between different objects in each frame of image, and defining the spatial composition relationship of the objects by using the actual spatial distance between the two objects. Therefore, accurate results can be obtained when the constructed space and/or graph model is subsequently used for target recognition, analysis and prediction.

In an alternative embodiment, the actual distance between two objects in the image may be determined using the principle of equal scaling. Specifically, the actual spatial distance between the two test targets may be measured in a preset scene, a frame of image including the two test targets is captured, and then the pixel distance between the two test targets in the image is calculated, so as to obtain the actual number of pixels corresponding to a unit length, for example, the actual number of pixels corresponding to 1 meter. Then, for two new targets needing to detect the actual spatial distance, the pixel number corresponding to the unit length in the actual scene is used as a factor, and the pixel distance of the two targets in one frame of image shot in the scene can be scaled by using a formula, so as to obtain the actual spatial distance of the two targets.

It will be appreciated that this approach is simple to implement, but is well suited to situations where the image is not distorted. In the case of distortion of an image, the pixel coordinates and the physical coordinates do not correspond to each other one to one, and the distortion needs to be corrected. Such as distortion removal by correcting the picture through cvinituninhibitormap and cvRemap, and so on. The implementation of such scaling and the specific process of image distortion modification can be understood by referring to the related art, and are not described herein again.

Alternatively, a monocular distance measurement may be used to determine the actual spatial distance between two targets in the image.

The monocular camera model may be considered approximately as a pinhole model. Namely, the distance measurement is realized by using the pinhole imaging principle. Optionally, a similar triangle may be constructed through a spatial position relationship between the camera and the actual object and a position relationship of the target in the image, and then an actual spatial distance between the targets is calculated.

Optionally, a correlation algorithm of a monocular distance measurement mode in the prior art may be used, and a horizontal distance d between an actual position of a pixel point and a video shooting device (video camera/camera) is calculated by using a pixel coordinate of the pixel point of the target_xAnd a vertical distance d_yThus realizing monocular distance measurement. Then through the known actual coordinates and d of the video shooting device_x、d_yAnd deducing and calculating the actual coordinates of the pixel points. Then, for two objects in the image, the actual spatial distance between the two objects can be calculated using the actual coordinates of the two objects.

In an optional implementation manner, the actual spatial distance between two targets in the image may be determined by calculating the actual coordinate point corresponding to the pixel point of the target.

And calculating the actual coordinate of the pixel point by the actual coordinate point corresponding to the pixel point of the calculation target.

Optionally, a monocular visual positioning and ranging technique may be employed to obtain the actual coordinates of the pixels.

The monocular vision positioning distance measuring technology has the advantages of low cost and fast calculation. Specifically, two modes can be included:

1) and obtaining the actual coordinates of each pixel by utilizing positioning measurement interpolation.

Taking into account the isometric enlargement of the pinhole imaging model, the measurement can be performed by directly printing paper full of equidistant array dots. And measuring equidistant array points (such as a calibration plate) at a higher distance, interpolating, and then carrying out equal-proportion amplification to obtain the actual ground coordinates corresponding to each pixel point. Such an operation can eliminate the need to manually measure the graphical indicia on the ground. After the dot pitch on the paper is measured, H/H (height ratio) is amplified to obtain the coordinates of the pixel corresponding to the actual ground. In order to avoid that the keystone distortion of the upper edge of the image is too severe, so that the mark points on the printing paper are not easy to identify, the method needs to prepare equidistant array circular point maps with different distances.

2) And calculating the actual coordinates of the pixel points according to the similar triangular proportion.

The main idea of this approach is still the pinhole imaging model. But the requirement for calibrating video shooting equipment (video camera/still camera/camera) is higher, and the distortion caused by the lens is smaller, but the method has stronger transportability and practicability. The video camera may be calibrated, for example, by using MATLAB or OPENCV, and then the conversion calculation of the pixel coordinates in the image is performed.

In the following, an optional mode of this mode is selected for explanation, and S4 may include S41 ℃

S43：

S41, determining the pixel coordinate of each target in each frame image;

for example, a boundary box containing the target and pixel coordinates of all pixel points in the boundary box can be determined as pixel coordinates of the target; or a pixel point on or in the bounding box may be selected as the pixel coordinate of the target, that is, the pixel coordinate of the target is used to represent the target, for example, the center position coordinate of the bounding box may be selected as the pixel coordinate of the target, and so on.

S42, aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

image of a personThe pixel coordinates of any one of the pixel points are known. The imaging process of the camera involves four coordinate systems: a world coordinate system, a camera coordinate system, an image physical coordinate system (also called an imaging plane coordinate system), a pixel coordinate system, and a transformation of these four coordinate systems. The transformation relationships between these four coordinate systems are known and derivable in the prior art. Then, the actual coordinates of the pixel points in the image in the world coordinate system can be calculated by using a coordinate system transformation formula, for example, the actual coordinates in the world coordinate system can be obtained from the pixel coordinates by using many public algorithm programs in OPENCV language. Specifically, for example, the corresponding world coordinates are obtained by inputting the camera parameters, rotation vectors, translation vectors, pixel coordinates, and the like in some OPENCV programs, using a correlation function. The actual coordinates of the center position of the bounding box representing the target A in the world coordinate system are assumed to be (X)_A,Y_A) The actual coordinate corresponding to the coordinate of the center position of the bounding box representing the target B in the world coordinate system is (X)_B,Y_B). Further, if the object A has an actual height, the actual coordinates of the object A are

Where H is the actual height of the object a and H is the height of the video capture device.

And S43, for each frame image, obtaining the actual spatial distance between each two objects in the frame image by using the actual coordinates of the two objects in the frame image.

The method for solving the distance between two points by using actual coordinates belongs to the prior art. For the above example, the actual spatial distance D between targets a and B, without considering the actual height of the targets, is:

of course, the case of considering the target actual height is similar thereto.

Optionally, if the multiple pixel coordinates of the objects a and B are obtained in S41, it is reasonable to calculate multiple actual distances between the objects a and B by using the multiple pixel coordinates, and then select one of the actual distances as the actual spatial distance between the objects a and B according to a certain selection criterion, for example, select the smallest actual distance as the actual spatial distance between the objects a and B.

Details of the above-mentioned solutions can be found in computer vision (computer vision) and related concepts related to camera Calibration (camera Calibration), world coordinate system, camera coordinate system, image physical coordinate system (also called imaging plane coordinate system), pixel coordinate system, LABVIEW vision development, OPENCV correlation algorithm, LABVIEW paradigm, calibriation paradigm, etc., which are not described herein again.

In an optional implementation, determining the actual spatial distance between different targets in each frame of image may also be implemented by using a binocular camera optical image ranging method.

The binocular cameras are the same as the human binoculars, the images of the same object shot by the two cameras have difference due to different angles and positions, the difference is called as parallax, the size of the parallax is related to the distance between the object and the cameras, and the target can be positioned according to the principle. The binocular camera optical image ranging is realized by calculating the parallax of two images shot by a left camera and a right camera. The specific method is similar to monocular camera optical image ranging, but has more accurate ranging and positioning information compared with a monocular camera. For a specific distance measurement process of the binocular camera optical image distance measurement method, reference is made to related prior art, and details are not repeated here.

In an alternative embodiment, determining the actual spatial distance between different objects in each frame of image may also include:

The depth camera ranging method can directly obtain the depth information of the target from the image, and can accurately and quickly obtain the actual spatial distance between the target and the video shooting equipment without coordinate calculation, so that the actual spatial distance between the two targets is determined, and the accuracy and the timeliness are higher. For a specific distance measurement process of the depth camera distance measurement method, please refer to the related prior art, which is not described herein.

And S5, generating a space and OR graph model of the preset scene by using the attribute information and the actual space distance of the target corresponding to each matched frame image.

And regarding the detected targets and the attribute information of the targets as leaf nodes of a space AND-OR graph and regarding the actual space distance between different targets as space constraint of the space AND-OR graph for each frame image, thereby generating the space AND-OR graph of the frame image. And forming a spatial and OR-map model of the preset scene by the spatial and OR-maps of all the frame images.

Taking a scene of a traffic intersection as an example, please refer to fig. 2, and fig. 2 is a schematic space and/or map diagram of the traffic intersection as an example according to the embodiment of the present invention.

The top diagram in fig. 2 represents a frame of image of a traffic intersection, which is a preset scene and is a root node of the space and or graph. Three targets are detected by the method, which are respectively the left, middle and right three diagrams at the lower part of the figure 2. The left image is a pedestrian, the image is marked with category information 'person' to represent human, and a boundary frame of the pedestrian is marked; the middle graph is a car, the image is marked with category information 'car' to represent the car, and a boundary frame of the car is also marked; the right image is a bus, and the image is marked with category information "bus" indicating the bus and a bounding box of the bus. The above category information and the position information of the bounding box are the attribute information of the object. Meanwhile, if the same object, such as the car, in different frame images is also labeled with the ID, the car is distinguished from other objects in different frame images, for example, the ID of the car can be represented by numbers or symbols.

Pedestrians, cars and buses, and the three targets and the corresponding attribute information are leaf nodes of the space and OR graph. Where the actual spatial distance between each two objects serves as a spatial and/or spatial constraint of the map (not shown in fig. 2).

For the generation process of a space and/or diagram, reference may be made to the description of related prior art, which is not described herein again.

Further, after the space and or map model of the preset scene is generated, a new scene and a new spatial position relationship between the objects can be generated by using the space and or map model of the preset scene. For example, the space and/or map models of the two preset scenes may be integrated to obtain a new space and/or map model including the two preset scenes, thereby implementing scene expansion.

Optionally, after obtaining the space and/or map model of the preset scene, the space and/or map model may be output, for example, the space and/or map model may be output to a display device for displaying, or sent to the other devices, and so on. The display device or other devices may be a monitoring management device of a preset scene, such as a traffic supervision device, so as to intuitively know the position information of each target and the spatial position relationship information of the target from the spatial and/or map model.

Optionally, the spatial and/or graph model may also be used to search for a target in the scene video.

Optionally, the space and/or graph model may be used to track a certain target in the scene video, obtain trajectory information of the target, and perform activity analysis and the like by using the trajectory information.

Optionally, the spatial and/or graph model may be used to identify a position relationship between objects in the scene video, and corresponding operations may be performed using the position relationship; specifically, the actual spatial distance between two targets in the scene video can be determined, whether the actual spatial distance between the two targets is smaller than a preset distance or not is judged, and a reminding message, such as an alarm message, is generated when the actual spatial distance between the two targets is smaller than the preset distance. This is all reasonable.

In a second aspect, corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a device for generating a spatial and or map model of a scene object, as shown in fig. 3, where the device includes:

a video acquiring module 301, configured to acquire a scene video for a preset scene;

the target detection module 302 is configured to detect a target in a scene video by using a pre-trained YOLO _ v3 network, and obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the target;

the target tracking module 303 is configured to match the same target in each frame of image of the scene video by using a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

a distance calculation module 304, configured to determine an actual spatial distance between different targets in each frame of image;

the model generating module 305 is configured to generate a spatial and/or map model of the preset scene by using the attribute information and the actual spatial distance of the target corresponding to each matched frame image.

Optionally, the attribute information further includes:

category information of the object.

Optionally, the preset multi-target tracking algorithm includes:

the deep sort algorithm.

Optionally, the distance calculating module 304 is specifically configured to:

in each frame image, determining the pixel coordinate of each target;

Optionally, the distance calculating module 304 is specifically configured to:

Optionally, the preset scenario includes: and (6) traffic intersection.

Optionally, the pre-trained YOLO _ v3 network is trained from the MARS dataset and the Car-Reid dataset.

For the specific execution process of each module, please refer to the method steps of the first aspect, which are not described herein again.

In a third aspect, an embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, is configured to implement the steps of the method for generating a space and/or graph model of a scene object according to the first aspect.

The electronic device may be: desktop computers, laptop computers, intelligent mobile terminals, servers, and the like. Without limitation, any electronic device that can implement the present invention is within the scope of the present invention.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

Through above-mentioned electronic equipment, can realize: compared with the prior art that the position information of the target is obtained by using a traditional target detection algorithm, the target detection method adopts the pre-trained YOLO _ v3 network to detect the target, so that the precision and the efficiency of the target detection can be improved, and the aim of improving the accuracy and the real-time performance of the space and OR graph model of the scene is fulfilled.

In a fourth aspect, corresponding to the method for generating the space and or graph model of the scene object provided in the first aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for generating the space and or graph model of the scene object provided in the embodiment of the present invention are implemented.

The computer-readable storage medium stores an application program that executes the method for generating a space and/or graph model of a scene object according to the embodiment of the present invention when the application program runs, so that the method can implement: compared with the prior art that the position information of the target is obtained by using a traditional target detection algorithm, the target detection method adopts the pre-trained YOLO _ v3 network to detect the target, so that the precision and the efficiency of the target detection can be improved, and the aim of improving the accuracy and the real-time performance of the space and OR graph model of the scene is fulfilled.

For the apparatus/electronic device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

It should be noted that the apparatus, the electronic device, and the storage medium according to the embodiments of the present invention are respectively an apparatus, an electronic device, and a storage medium to which the space and or graph model generation method for a scene object is applied, and all embodiments of the space and or graph model generation method for a scene object are applicable to the apparatus, the electronic device, and the storage medium, and can achieve the same or similar beneficial effects.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for generating a spatial and OR graph model of a scene object is characterized by comprising the following steps:

acquiring a scene video aiming at a preset scene;

2. The method of claim 1, wherein the attribute information further comprises:

category information of the object.

3. The method of claim 1, wherein the predetermined multi-target tracking algorithm comprises:

the deep sort algorithm.

4. The method of claim 1, wherein determining the actual spatial distance between different objects in each frame of image comprises:

in each frame image, determining the pixel coordinate of each target;

5. The method of claim 1, wherein determining the actual spatial distance between different objects in each frame of image comprises:

6. The method according to any one of claims 1 to 5, wherein the presetting of the scene comprises:

and (6) traffic intersection.

7. The method of claim 6, wherein the pre-trained YOLO _ v3 network is trained from a MARS dataset and a Car-Reid dataset.

8. An apparatus for generating a spatial and OR model of a scene object, comprising:

9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-7.

10. A computer-readable storage medium, characterized in that,

the computer-readable storage medium has stored therein a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.