CN112101303A

CN112101303A - Image data processing method and device and computer readable storage medium

Info

Publication number: CN112101303A
Application number: CN202011222364.6A
Authority: CN
Inventors: 王昌安
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2020-12-18
Anticipated expiration: 2040-11-05
Also published as: CN112101303B

Abstract

The application discloses an image data processing method, an image data processing device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a detection image of at least two target objects based on the camera device; the at least two target objects comprise a detection target object and an auxiliary target object of the detection target object; determining a pixel distance between the detection target object and each auxiliary target object and an image pickup distance of each target object for the image pickup apparatus; acquiring a history image obtained by the camera equipment, and acquiring a history object detection frame of a history object indicated by a detection target object from the history image; determining a target detection frame aiming at the detection target object according to the pixel distance between the detection target object and the auxiliary target object, the camera shooting distance of each target object and a historical object detection frame; and generating a target object density map aiming at the detection target object according to the target detection frame. By the method and the device, the accuracy of the acquired target object density map can be improved.

Description

Image data processing method and device and computer readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to an image data processing method and apparatus, and a computer-readable storage medium.

Background

With the continuous development of computer networks, machine learning has penetrated aspects of life. The image recognition model used for recognizing the image can be trained through machine learning, the image recognition model can be used for detecting the density of people in the image, and the density of the people can be understood as the number of the people.

The image recognition model needs to be trained by a density map, which may also be referred to as a thermodynamic map and is obtained by convolving an original sample image by a gaussian kernel. It is known that the gaussian kernel used to convolve the original sample image is related to the size of the user's head in the sample image. A human head may correspond to a gaussian kernel, and if the human head is larger in size, the standard deviation of the corresponding gaussian kernel is larger.

In the prior art, each human head in the sample image is usually specified to have the same certain fixed human head size, and then the fixed gaussian kernel corresponding to the human head can be obtained through the fixed size. However, since the size of each human head is usually different, if the same gaussian kernel is set for each human head, the gaussian kernel set for each human head is inaccurate, and thus the density map obtained by the gaussian kernel is also inaccurate, which finally results in poor recognition accuracy of the trained image recognition model.

Disclosure of Invention

The application provides an image data processing method, an image data processing device and a computer readable storage medium, which can improve the accuracy of an acquired target object density map.

One aspect of the present application provides an image data processing method, including:

acquiring a detection image including at least two target objects based on an image pickup apparatus; the at least two target objects are composed of a detection target object and an auxiliary target object; detecting the center position of an object marked with each target object in the image;

determining a pixel distance between the detection target object and each auxiliary target object respectively and an image pickup distance of each target object for the image pickup device according to the object center position of each target object;

acquiring a history image obtained by the camera equipment, and acquiring a history object detection frame of a history object indicated by the object center position of the detection target object from the history image;

determining a target detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object, the image pickup distance of each target object for the image pickup apparatus, and the history object detection frame;

and generating a target object density map aiming at the detection target object according to the target detection frame.

An aspect of the present application provides an image data processing apparatus, including:

an image acquisition module configured to acquire a detection image including at least two target objects based on an image pickup apparatus; the at least two target objects are composed of a detection target object and an auxiliary target object; detecting the center position of an object marked with each target object in the image;

a distance determination module for determining a pixel distance between each detection target object and each auxiliary target object, and an image pickup distance of each target object for the image pickup apparatus, according to an object center position of each target object;

a history frame acquisition module configured to acquire a history image based on the history image obtained by the image pickup apparatus, and acquire a history object detection frame of a history object indicated by an object center position of the detection target object from the history image;

a target frame generation module for generating a target detection frame for the detection target object according to a pixel distance between the detection target object and each auxiliary target object, an image pickup distance of each target object for the image pickup apparatus, and a history object detection frame;

and the density map generation module is used for generating a target object density map aiming at the detected target object according to the target detection frame.

Wherein, the distance determination module includes:

the pixel distance determining unit is used for acquiring the number of interval pixels between the object center position of the detection target object and the object center position of each auxiliary target object, and determining the pixel distance between the detection target object and each auxiliary target object according to the number of interval pixels to which each auxiliary target object belongs;

and the image pickup distance determining unit is used for determining the vertical direction distance of each target object in the detection image according to the object center position of each target object, and determining the image pickup distance of each target object relative to the image pickup equipment according to the vertical direction distance to which each target object belongs.

Wherein, history frame acquisition module includes:

the historical object sorting unit is used for sorting the at least one historical object according to the pixel distance between the object center position of the at least one historical object in the historical image and the object center position of the detection target object respectively to obtain at least one sorted historical object;

a history object obtaining unit, configured to select a reference history object from at least one sorted history object according to the first object obtaining number; the number of reference history objects is less than or equal to the first object acquisition number;

and a history frame determination unit configured to determine an object detection frame for labeling the reference history object in the history image as a history object detection frame.

Wherein, the target frame generation module comprises:

an initial frame generating unit configured to generate an initial detection frame for the detection target object according to pixel distances between the detection target object and each of the auxiliary target objects, respectively;

a transition frame generation unit configured to generate a transition detection frame for detecting a target object, based on an imaging distance of each target object to the imaging apparatus and the initial detection frame;

and the target frame generating unit is used for generating a target detection frame aiming at the detection target object according to the historical object detection frame and the transition detection frame.

Wherein, the initial frame generating unit includes:

the first sequencing subunit is used for sequencing at least one auxiliary target object according to the pixel distance between the detection target object and each auxiliary target object respectively to obtain at least one sequenced auxiliary target object;

a first object acquisition subunit, configured to select, according to the second object acquisition number, a first reference target object from the sorted at least one auxiliary target object; the number of the first reference target objects is less than or equal to the second object acquisition number;

a distance average acquisition subunit configured to acquire a distance average of pixel distances between the first reference target object and the detection target object;

the initial size obtaining subunit is used for weighting the distance average value according to the frame size weighting coefficient to obtain an initial frame size;

and the initial frame generation subunit is used for generating an initial detection frame according to the initial frame size.

Wherein, the transition frame generating unit includes:

the second sequencing subunit is used for sequencing at least one auxiliary target object according to the shooting distance difference between the shooting distance to which the detection target object belongs and the shooting distance to which each auxiliary target object belongs to obtain at least one sequenced auxiliary target object;

a second object obtaining subunit, configured to select, according to the third object obtaining number, a second reference target object from the sorted at least one auxiliary target object; the number of the second reference target objects is less than or equal to the third object acquisition number;

the size mean value acquiring subunit is used for acquiring a detection frame size mean value of an object initial detection frame to which the second reference target object belongs;

and the transition frame generation subunit is used for generating a transition detection frame according to the detection frame size mean value corresponding to the second reference target object and the initial detection frame.

Wherein the transition frame generation subunit includes:

the first coefficient acquisition subunit is configured to acquire a first size weighting coefficient of a detection frame size mean value corresponding to the second reference target object, and acquire a second size weighting coefficient of the initial detection frame;

the first weighting subunit is configured to weight the detection frame size average value corresponding to the second reference target object according to the first size weighting coefficient to obtain a first frame size;

the second weighting subunit is configured to weight the frame size of the initial detection frame according to a second size weighting coefficient to obtain a second frame size;

and the first frame generation subunit is used for generating the transition detection frame according to the first frame size and the second frame size.

Wherein, the target frame generating unit includes:

the frame mean value acquiring subunit is used for acquiring a detection frame size mean value corresponding to the historical object detection frame;

the second coefficient acquisition subunit is used for acquiring a third size weighting coefficient of the detection frame size mean value corresponding to the historical object detection frame and acquiring a fourth size weighting coefficient of the transition detection frame;

the third weighting subunit is configured to weight, according to a third size weighting coefficient, the detection frame size average value corresponding to the historical object detection frame to obtain a third frame size;

the fourth weighting subunit is configured to weight the frame size of the transition detection frame according to a fourth size weighting coefficient to obtain a fourth frame size;

and the second frame generation subunit is used for generating the target detection frame according to the third frame size and the fourth frame size.

Wherein, the density map generation module includes:

an initial density map generating unit configured to generate an initial object density map for the detection image based on an object center position of each target object;

the Gaussian kernel determining unit is used for determining a Gaussian kernel standard deviation based on the target detection frame and determining a target Gaussian kernel aiming at the target object to be detected according to the Gaussian kernel standard deviation;

and the convolution unit is used for performing convolution operation on the object initial density map based on the target Gaussian core to generate a target object density map aiming at the detection target object.

Wherein, the initial density map generation unit includes:

the traversal subunit is configured to perform traversal on at least two pixel points in the detection image, set a pixel value of a pixel point located at an object center position of each target object to be traversed as a first pixel value, and set a pixel value of a pixel point not located at an object center position of each target object to be traversed as a second pixel value;

and the density map generating subunit is used for generating an initial object density map according to the first pixel value and the second pixel value set for the at least two pixel points.

Wherein, the target frame generation module comprises:

a first frame generation unit, configured to generate, in an x-th iteration process, a transition detection frame gx for a detection target object according to an imaging distance of each target object to the imaging apparatus, a history object detection frame, and an undetermined detection frame dx-1 for the detection target object generated in an x-1 th iteration process; if the x-1 iteration process is the first iteration process, the frame dx-1 to be detected is obtained based on the pixel distance between the detection target object and each auxiliary target object;

the second frame generation unit is used for generating an undetermined detection frame dx aiming at the detection target object according to the historical object detection frame and the transition detection frame gx;

a first frame determining unit, configured to determine the frame dx to be detected as a target detection frame when a size difference between a frame size of the frame dx to be detected and a frame size of the frame dx-1 to be detected is less than or equal to a size difference threshold;

a third frame generating unit, configured to generate a transition detection frame gx +1 for a detection target object according to a shooting distance of each target object for the shooting device and the frame to be detected dx during an (x + 1) th iteration process when a size difference between the frame size of the frame to be detected dx and the frame size of the frame to be detected dx-1 is greater than a size difference threshold;

a fourth frame generating unit, configured to generate an undetermined detection frame dx +1 for the detection target object according to the historical object detection frame and the transition detection frame gx + 1;

and the second frame determining unit is used for determining the frame dx +1 to be detected as the target detection frame when the size difference between the frame size of the frame dx +1 to be detected and the frame size of the frame dx to be detected is smaller than or equal to the size difference threshold.

Wherein, above-mentioned device still includes:

the model training module is used for training an object density detection model based on the target object density graph and determining the trained object density detection model as a target detection model;

the input module is used for inputting an image to be detected comprising an object to be detected into the target detection model and generating an object density map aiming at the object to be detected in the target detection model;

and the integration module is used for performing integration operation on the object density map to obtain the number of the objects to be detected in the image to be detected.

An aspect of the application provides a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of an aspect of the application.

An aspect of the application provides a computer-readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the above-mentioned aspect.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternatives of the above aspect and the like.

The method includes the steps that a detection image comprising at least two target objects can be obtained based on camera equipment; the at least two target objects are composed of a detection target object and an auxiliary target object; detecting the center position of an object marked with each target object in the image; determining a pixel distance between the detection target object and each auxiliary target object respectively and an image pickup distance of each target object for the image pickup device according to the object center position of each target object; acquiring a history image obtained by the camera equipment, and acquiring a history object detection frame of a history object indicated by the object center position of the detection target object from the history image; generating a target detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object, the image pickup distance of each target object for the image pickup device, and the historical object detection frame; and generating a target object density map aiming at the detection target object according to the target detection frame. Therefore, the method provided by the application can obtain the target detection frame with more accurate detection target object according to the pixel distance between the detection target object and each auxiliary target object, the shooting distance of each target object for the shooting equipment and the cooperation of the historical object detection frame, and further can generate the target object density map with more accuracy for the detection target object through the more accurate target detection frame.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a schematic view of a scene of image data processing provided herein;

FIG. 3 is a flow chart illustrating an image data processing method provided herein;

FIG. 4 is a schematic diagram illustrating a scenario for acquiring an initial detection frame according to the present application;

FIG. 5 is a schematic diagram illustrating a scenario for acquiring a transition detection box according to the present application;

FIG. 6 is a schematic diagram illustrating a scenario for acquiring a target detection frame according to the present application;

FIG. 7 is a schematic diagram of a scene for obtaining a density map according to the present application;

FIG. 8 is a schematic flow chart diagram illustrating an estimation block acquisition method provided herein;

FIG. 9 is a schematic diagram of an image data processing apparatus according to the present application;

fig. 10 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application relates to artificial intelligence related technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The present application relates generally to machine learning in artificial intelligence. Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The machine learning referred to in the present application mainly refers to how to generate a more accurate target detection frame for detecting a target object in a detection image, so as to perform a better training on an object density recognition model by using a density map obtained by the more accurate detection frame, and specifically, refer to the following description in the embodiment corresponding to fig. 3.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 200 and a camera apparatus cluster, and the camera apparatus cluster may include one or more camera apparatuses, where the number of camera apparatuses is not limited herein. As shown in fig. 1, the plurality of image capturing apparatuses may specifically include an image capturing apparatus 100a, an image capturing apparatus 101a, image capturing apparatuses 102a, …, an image capturing apparatus 103 a; as shown in fig. 1, the image capturing apparatus 100a, the image capturing apparatus 101a, the image capturing apparatuses 102a, …, and the image capturing apparatus 103a may each be network-connected to the server 200 so that each image capturing apparatus can perform data interaction with the server 200 through the network connection. Alternatively, each image pickup apparatus may also communicate with the server 200 through a connection wire.

The server 200 shown in fig. 1 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal device may be: the intelligent terminal comprises intelligent terminals such as a smart phone, a tablet computer, a notebook computer, a desktop computer and an intelligent television. A specific description of an embodiment of the present application will be made below taking communication between the image pickup apparatus 100a and the server 200 as an example.

Referring to fig. 2, fig. 2 is a schematic view of a scene of image data processing according to the present application. As shown in fig. 2, firstly, a camera image 100b may be captured by the camera device 100a, and after the camera device 100b captures the camera image 100b, the camera device 100b may synchronize the camera image 100b with the server 200, so that the server 200 may detect the head of the user in the camera image 100b, and further, a more accurate density map of the camera image 100b may be obtained through the detected head frame of the user, and the density map may be used for training the object density detection model. Here, the server 200 may detect the human head frame of each user in the captured image 100b through the following steps s1, s2, and s3, and the human head frame may be referred to as a detection frame.

It should be noted that, when the captured image 100b includes a plurality of captured users, the principle of detecting the head frame of each user in the captured image 100b is the same, and therefore, the following description is given by taking the detection of the head frame of the user 103b in the captured image 100b as an example.

First, as shown in fig. 2, the captured image 100b includes a user 101b, a user 103b, a user 104b, and a user 105 b. Among them, since when the captured image 100b is captured by the image capturing apparatus 100a, generally, the closer the user is to the lens of the image capturing apparatus 100b, the larger the person's head of the user in the captured image 100b is when captured. Conversely, the farther away from the lens of the image pickup apparatus 100b, the smaller the head of the user is in the picked-up image 100b when it is picked up.

Therefore, in the captured image 100b, the distance (which may be referred to as a shooting distance) of the user 101b from the lens of the imaging apparatus 100a is longer than the distance of the user 103b from the lens of the imaging apparatus 100a, and therefore, the human head of the user 101b in the captured image 100b is generally larger than the human head of the user 103b in the captured image 100 b. The distance of the user 103b from the lens of the image capture apparatus 100a is substantially equal to the distance of the user 104b from the lens of the image capture apparatus 100a, and therefore the person of the user 103b in the captured image 100b is substantially equal to the person of the user 104b in the captured image 100 b. The distance between the user 103b and the user 104b from the lens of the image capturing apparatus 100a is longer than the distance between the user 105b from the lens of the image capturing apparatus 100a, and therefore, the human heads of the user 103b and the user 104b in the captured image 100b are generally larger than the human head of the user 105b in the captured image 100 b.

The user's frame of the head in the sample image labeled with the center point of the head (the sample image may be the following detection image) may be detected, so the above-mentioned image 100b may be used as the sample image for training the object density detection model, the center point of the head included in the image 100b may be labeled manually or may be labeled in advance through machine learning (for example, the center detection model of the head).

S1: pre-estimating a human head frame: since the camera image 100b may include the heads of a plurality of users, the server 200 may acquire, from the camera image 100b, a plurality of users whose head center positions are closest to the head center position of the user 103b (the specific number may be set by themselves), and may pre-estimate a head frame 102b of the user 103b according to distances between the head center positions of the plurality of users and the head center position of the user 103b, where the head frame 102b is a head frame of the user 103b obtained by rough estimation in advance. The specific process of how the server 200 obtains the head frame 102b of the user 103b through several users whose head center positions are closest to the head center position of the user 103b may be referred to the following description in the embodiment corresponding to fig. 3.

S2: horizontal prior constraint: since, users who are on the same horizontal line in the captured image 100b can be considered to be equidistant from the lens of the imaging apparatus 100a, while users who are generally equidistant from the lens of the imaging apparatus 100a can be considered to be substantially equal in the size of their heads in the captured image 100 b. Therefore, the initial human head frame 102b of the user 103b obtained as described above can be corrected by the initial human head frame 108b of the user 104b on the same horizontal line as the user 103b in the captured image 100 b. It will be appreciated that the principle of obtaining the initial human head box 108b of the user 104b is the same as the principle of obtaining the initial human head box 102b of the user 103 b. The specific process of how to modify the obtained initial human head box 102b of the user 103b through the initial human head box 108b of the user 104b on the same horizontal line as the user 103b may also be referred to the following description in the embodiment corresponding to fig. 3. The initial head frame 102b is corrected by the initial head frame 108b, and the head frame 107b of the user 103b can be obtained. The human head box 107b is the result of the correction of the initial human head box 102 b.

S3: since the human heads of the users at the same spatial position in the plurality of captured images captured by the image capturing apparatus 100a can be considered to be substantially equal in size when the position of the image capturing apparatus 100a in the real world is unchanged. Therefore, the server 200 can also acquire the history image 106b captured by the imaging apparatus 100a before capturing the above-described captured image 100 b. Therefore, the server 200 may detect several users (the specific number may be set by itself) closest to the spatial position of the user 103b in the captured image 100b from the history image 106b, and may correct the human head frame 107b of the user 103b by using the human head frames marked by the several users in the history image 106 b. For example, assuming that the spatial position of the user 103b in the captured image 100b is the same as the spatial position indicated by the area 111b in the history image 106b, the server 200 may correct the person's head frame 107b of the user 103b by the person's head frame 110b annotated by the user 109b closest to the position between the areas 111b in the history image 106b and the person's head frame 112b annotated by the user 118 b. By correcting the human head frame 107b of the user 103b, the final human head frame 113b of the user 103b can be obtained (the human head frame 113b may be a target detection frame for detecting a target object described below). A specific process of how to modify the human head box 107b of the user 103b through the human head box 110b labeled by the user 109b closest to the position between the areas 111b and the human head box 112b labeled by the user 118b may also be referred to the following description in the embodiment corresponding to fig. 3.

After obtaining the head frame 113b of the user 103b, the server 200 may generate a density map (also referred to as thermodynamic map) 114b for the user 103b through the head frame 113 b. In practice, the server 200 may acquire the density map corresponding to each of the other users in the captured image 100b in the same manner as the density map 114b for the user 103b (the density map 114b may be a target object density map of a detection target object described below), and may obtain the final density map of the captured image 100b by superimposing the density maps corresponding to each of the other users in the captured image 100 b. The trained object density detection model 115b can be obtained by training the object density detection model using the final density map. The object density detection model 115b can be used to detect the crowd density in the image, in other words, the object density detection model 115b can be used to detect the total number of people in the image, which can be represented by the total number of people in the image.

Therefore, when there is an image 116b in which the crowd density needs to be detected, the server 200 may input the image 116b into the object density detection model 115b, detect the crowd density of the crowd in the image 116b by the object density detection model 115b, and output the crowd density 117b of the crowd in the image 116b, where the crowd density 117b may be obtained by comparing the total number of people in the image 116b detected by the object density detection model 115 with the floor area included in the image 116 b.

By adopting the method provided by the application, the human head frame of the user 103b can be corrected by the user who is in the same or approximate horizontal line with the user 103b in the camera image and the user who is in the same or approximate spatial distance with the user 103b in the historical image, so as to obtain the final human head frame 113b of the user 103 b. This improves the accuracy of the acquired final human head frame 113b of the user 103b, which in turn improves the accuracy of the density map of the acquired photographic image 100 b. Therefore, the population density of the population in the image can be detected more accurately by the object density detection model trained from the high-precision density map of the captured image 100 b.

Referring to fig. 3, fig. 3 is a schematic flowchart of an image data processing method provided in the present application, and as shown in fig. 3, the method may include:

step S101, acquiring a detection image comprising at least two target objects based on an image pickup device; the at least two target objects are composed of a detection target object and an auxiliary target object; detecting the center position of an object marked with each target object in the image;

specifically, the execution main body in this embodiment may be any one computer device or a computer device cluster formed by a plurality of computer devices, and the computer device may be a server or a terminal device. Here, the description will be given taking an execution subject in the embodiment of the present application as an example of a server.

The detection image may be captured by an image capturing device, which may be any device that can be used for capturing images, for example, any camera or any terminal that can be used for capturing images.

The detection image may be a sample image used to train an object density detection model. The detection image may include a plurality of (at least two) target objects, which may refer to the head of the user who is photographed. In other words, a plurality of persons' heads captured may be included in the detection image. The center position of the head of each person included in the detection image is also labeled in advance, and the center position of the head of each person may be referred to as an object center position.

Therefore, it can be known that the method provided in the embodiment of the present application mainly describes how to generate a most accurate detection frame for each target object in a detection image through the detection image labeled with the center position of the target object, please refer to the following description.

The final detection frame of the generated target object may be referred to as a target detection frame. Since the principle of generating a target detection frame for detecting each target object in an image is the same, a process of generating a target detection frame for detecting one target object in an image will be described as an example. The target object that the server currently needs to generate the target detection frame may be a detection target object, and the detection target object may be any one target object in the detection image, in other words, any one target object in the detection image may be a detection target object, so it can be understood that there may be a plurality of detection target objects, which are a plurality of target objects in the detection image. One detection target object has an auxiliary target object corresponding to the detection target object, and target objects other than a certain detection target object in the detection image can be called auxiliary target objects of the detection target object, and the number of the auxiliary target objects can be one or more. Therefore, it can be said that the plurality of target objects in the detection image are composed of the detection target object and the auxiliary target object.

The following specifically describes a process of generating an object detection frame for detecting an object.

Step S102, determining the pixel distance between each detection target object and each auxiliary target object and the image pickup distance of each target object for the image pickup device according to the object center position of each target object;

specifically, the server may obtain, according to the object center position of each target object marked in the detection image, the pixel distance between each detection target object and each auxiliary target object by detection: the server may detect the number of interval pixels between the object center position of the detection target object and the object center position of each auxiliary target object, where the number of interval pixels is the number of pixels existing between the object center position of the detection target object and the object center position of the auxiliary target object.

The server may directly use the number of interval pixels between the object center position of the detection target object and the object center position of each auxiliary target object, respectively, as the pixel distance between the detection target object and each auxiliary target object, respectively.

Alternatively, the server may multiply the number of interval pixels to which each auxiliary target object belongs by a predetermined coefficient (the value of the coefficient may be set according to the actual application scenario, and is equal to 0.1, for example), and then use the number of interval pixels multiplied by the predetermined coefficient as the pixel distance between the detection target object and the corresponding auxiliary target object. The pixel distance represents the distance between the detection target object and the auxiliary target object in the detection image, and the larger the pixel distance between the auxiliary target object and the detection target object is, the farther the auxiliary target object is from the detection target object in the detection image is. Conversely, a smaller pixel distance between the auxiliary target object and the detection target object indicates a closer distance between the auxiliary target object and the detection target object in the detection image.

The server can also detect and obtain the image pickup distance of each target object to the image pickup device according to the object center position of each target object marked in the detection image: the server may acquire the vertical direction distance of each target object in the detection image as the distance of each target object to the lens of the image capturing apparatus, that is, as the image capturing distance to the image capturing apparatus.

The vertical distance of each target object in the detection image may refer to a distance between an object center position of each target object and a lower edge of the detection image. It is to be noted that, when a detection image is captured by an image capturing apparatus, target objects having the same or nearly similar vertical direction distances may be regarded as target objects on the same horizontal line, and target objects on the same horizontal line in the detection image may be regarded as image capturing distances for the image capturing apparatus to be equal. And the object size of the target object (i.e., the size of the head of the person) can be considered to be nearly the same between target objects whose imaging distances for the imaging apparatus are equal.

A step S103 of acquiring a history image obtained by the camera equipment, and acquiring a history object detection frame of a history object indicated by the object center position of the detection target object from the history image;

specifically, the server may further acquire a history image captured by the image capturing apparatus, where the history image may also include a plurality of target objects, and the target objects in the history image may be referred to as history objects, and the history image may include a plurality of history objects, and the history objects may also be the heads of the captured users. The history image may be an image captured by the image capturing apparatus before starting generation of the target detection frame of the detection target object in the detection image. It should be noted that, when the image capturing apparatus captures the detected image and the history image, the position of the image capturing apparatus in the real world may be the same, and the camera parameters of the image capturing apparatus may also be the same, so that it may be ensured that when the detected image and the history image are captured, the real scenes in the detected image and the history image may be the same, and the capturing distance of the image capturing apparatus for the real scenes may also be the same.

It is to be noted that since the above-described detection image and history image are captured at different times by the same imaging apparatus, the object sizes of the history objects having a spatial position close to the detection target object of the detection image in the history image can be considered to be also approximately equal. In other words, every position (which may be referred to as a spatial position) in the real scene captured in the detection image corresponds to the same position in the history image.

Therefore, the object center position of the detection target object in the detection image can be found in the history image, and the position in the history image corresponding to the object center position of the detection target object in the detection image can be referred to as a corresponding position, and the corresponding position represents the object center position of the detection target object in the history image. It is understood that the detection target object does not exist in the history image. The object center position of each of the history objects may be marked in the history image, and the sizes of the history objects having the same or similar object center positions in the history image may be considered to be approximately equal.

Therefore, it can be considered that the object size of the history object to which the object center position belonging to the closer the pixel distance between the corresponding position of the detection target object in the history image belongs is approximately equal to the object size of the detection target object. The pixel distance between the history object and the detection target object may be the number of interval pixels between the object center position of the history object and the corresponding position of the detection target object in the history image, or a value obtained by multiplying the number of interval pixels by a certain coefficient (which may be set by itself, and is equal to 0.3, for example). The pixel distance between the history object and the detection target object may be referred to as a pixel distance between the object center position of the history object and the object center position of the detection target object.

One or more history objects can be provided. Each history object may be sorted according to a pixel distance between the detection target object and each history object, so as to obtain each sorted history object. Each of the sorted history objects may be arranged according to the corresponding pixel distance from small to large.

The server may select a reference history object from each of the sorted history objects according to the first object acquisition number. The first object acquisition number is the maximum number of the reference history objects required to be acquired, that is, the number of the acquired reference history objects is less than or equal to the first object acquisition number, and the value of the first object acquisition number can be set according to the actual application scene. The reference target object is the history object at the front position in each sorted history object. For example, if the first object acquisition number is 3, the top 3 history objects among the sorted history objects may be acquired as reference history objects. If the number of history objects is less than 3, all the history objects may be used as reference history objects, and the number of reference history objects is less than the first object acquisition number.

In other words, a plurality of history objects closest in pixel distance to the object center position of the detection target object may be acquired from the history images as reference target objects, the number of which is smaller than or equal to the above-described first object acquisition number. That is, a plurality of history objects having the closest spatial positions to the detection target object may be acquired from the history images as the reference target object.

It should be noted that each historical object in the historical image has been labeled with a corresponding detection frame (i.e., a human head frame), and the labeled detection frame in the historical image may be labeled by the method for generating the target detection frame for detecting the target object described in this application. The detection frame labeled for each history object in the history image may be referred to as an object detection frame for labeling each history object.

Therefore, the reference history object described above may be used as the history object indicated by the object center position of the detection target object, and the object detection frame used to label the reference history object in the history image may be used as the history object detection frame of the history object indicated by the object center position of the detection target object.

Step S104, determining a target detection frame aiming at the detection target object according to the pixel distance between the detection target object and each auxiliary target object, the image pickup distance of each target object aiming at the image pickup device and the historical object detection frame;

specifically, the server may generate an initial detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object; generating a transition detection frame aiming at the detection target object according to the shooting distance of each target object aiming at the shooting equipment and the initial detection frame; generating a target detection frame aiming at the detection target object according to the historical object detection frame and the transition detection frame:

first, it is described how the server generates an initial detection frame for a detection target object according to the pixel distance between the detection target object and each auxiliary target object, respectively:

the server may sort each auxiliary target object according to the size of the pixel distance between the detection target object and each auxiliary target object, respectively, to obtain the sorted auxiliary target objects. Each of the sorted auxiliary target objects may be arranged in descending order of the pixel distance from the detection target object.

The server may obtain the first reference target object from the sorted auxiliary target objects by the second object obtaining number. The second object acquisition number is the maximum number of the first reference target objects that can be acquired, and the second object acquisition number can be set according to the actual application scenario, so the number of the first reference target objects can be less than or equal to the second object acquisition number.

The first reference target object may be a plurality of auxiliary target objects positioned most forward among the sorted auxiliary target objects. For example, when the second object acquisition number is equal to 5, the first reference target object may be the first 5 ranked auxiliary target objects among the ranked auxiliary target objects. If the number of auxiliary target objects is less than 5, the first reference target object may be all auxiliary target objects, and the number of the first reference target objects is less than the number of the acquired second objects.

In other words, the first reference target object may be a plurality of auxiliary target objects having a minimum pixel distance from the detection target object, and the number of the first reference target objects may be less than or equal to the second object acquisition number.

The server may obtain a distance average of pixel distances between the first reference target object and the detection target object. For example, there are 3 first reference target objects, the 3 first reference target objects including target object 1, target object 2, and target object 3. Wherein the pixel distance between the target object 1 and the detection target object is 2, the pixel distance between the target object 2 and the detection target object is 4, and the pixel distance between the target object 3 and the detection target object is 3, then the average distance value of the pixel distances between the first reference target object and the detection target object is equal to (2 +4+ 3)/3, that is, equal to 3.

The server may also weight the distance average of the pixel distances between the first reference target object and the detection target object by a size weighting coefficient to obtain a value, which may be referred to as the initial frame size. The initial frame size may be a side length of an initial detection frame of the detection target object to be generated, and the initial detection frame may be a square frame. The server may generate an initial detection frame for the detection target object by using the initial frame size, and a frame center position of the initial detection frame may be an object center position of the detection target object in the detection image. The size weighting coefficient is a constant term, and may be set according to an actual application scenario, for example, the size weighting coefficient may be equal to 0.3.

Referring to fig. 4, fig. 4 is a schematic view of a scene for acquiring an initial detection frame according to the present application. As shown in fig. 4, a target object 101c, a target object 104c, a target object 106c, a target object 107c, and a target object 110c are included in the detection image 100 c. The object center position of the target object 101c is position 102c, the object center position of the target object 104c is position 103c, the object center position of the target object 106c is position 105c, the object center position of the target object 107c is position 108c, and the object center position of the target object 110c is position 109 c.

Here, the target object 106c may be a detection target object, and therefore, the target object 101c, the target object 104c, the target object 107c, and the target object 110c are auxiliary target objects of the detection target object 106 c.

The above-described first object acquisition number may be equal to 3, and therefore, assuming that, of the object center positions of all the auxiliary target objects, the object center position 102c of the auxiliary target object 101c, the object center position 103c of the auxiliary target object 104c, and the object center position 108c of the auxiliary target object 107c are 3 positions having the smallest pixel distance from the object center position 105c of the detection target object 106c, the auxiliary target object 101c, the auxiliary target object 104c, and the auxiliary target object 107c may be set as the above-described first reference target object.

The pixel distance between the object center position 102c of the auxiliary target object 101c and the object center position 105c of the detection target object 106c is the pixel distance L2, the pixel distance between the object center position 103c of the auxiliary target object 104c and the object center position 105c of the detection target object 106c is the pixel distance L1, and the pixel distance between the object center position 108c of the auxiliary target object 107c and the object center position 105c of the detection target object 106c is the pixel distance L3.

Thus, as shown by region 111 c: an average value of the above-described pixel distance L1, pixel distance L2, and pixel distance L3 may be taken as a distance average value of the pixel distances between the above-described first reference target object and the detection target object, that is, the distance average value is equal to (L1 + L2+ L3)/3.

The server may use a product of the size weighting coefficient and the distance average as the initial frame size, where the initial frame size is an initial detection frame for detecting the target object. The initial detection frame for detecting the target object can be generated according to the initial frame size.

Next, a process of how the server corrects the above-obtained initial detection frame of the detection target object by the device distance of each target object with respect to the image capturing apparatus to obtain a transition detection frame of the detection target object will be described, which may be referred to as a process of applying a horizontal a priori constraint to the initial detection frame of the detection target object (since this process is implemented based on the principle that the object sizes of the target objects in the same horizontal line are nearly equal):

the server may sort each auxiliary target object according to a difference (may be referred to as a shooting distance difference) between the shooting distance to which the detection target object belongs and the shooting distance to which each auxiliary target object belongs, to obtain the sorted auxiliary target objects. Each of the auxiliary target objects sorted here may be arranged in descending order of the corresponding difference in imaging distance.

The server may obtain the second reference target object from the sorted auxiliary target objects by the third object obtaining number. The third object acquisition number is the maximum number of the second reference target objects that can be acquired, and the third object acquisition number may be set according to an actual application scenario, so the number of the second reference target objects may be less than or equal to the third object acquisition number.

The second reference target object may be a plurality of auxiliary target objects positioned most forward among the sorted auxiliary target objects. For example, when the third object acquisition number is equal to 3, the second reference target object may be the top 3 ranked auxiliary target objects among the ranked auxiliary target objects. If the number of auxiliary target objects is less than 3, the second reference target object may be all auxiliary target objects, and the number of the second reference target objects is less than the number of the acquired third objects.

In other words, the second reference target objects may be a plurality of auxiliary target objects having a minimum image-taking distance difference with the detection target object, and the number of the second reference target objects may be less than or equal to the third object acquisition number. Because the auxiliary target object having the smaller difference in imaging distance from the detection target object can be considered to be closer to being on the same horizontal line in the detection image as the detection target object.

The server may obtain a detection frame size average of the object initial detection frame to which the second reference target object belongs, and an obtaining manner of the object initial detection frame to which the second reference target object belongs may be consistent with an obtaining manner of the object initial detection frame to which the detection target object belongs. Since the detection frame of the target object may be a square detection frame, the detection frame size average of the object initial detection frame to which the second reference target object belongs may be an average of the side lengths of the object initial detection frames to which the second reference target object belongs.

For example, if there are 3 second reference target objects, the 3 second reference target objects include target object 1, target object 2, and target object 3. Wherein, the side length of the object initial detection frame of the target object 1 may be 2, the side length of the object initial detection frame of the target object 2 may be 3, and the side length of the object initial detection frame of the target object 3 may be 4, and then the average detection frame size of the object initial detection frame to which the second reference target object belongs may be equal to (2 +3+ 4)/3, that is, equal to 3.

The server may generate a transition detection frame according to the detection frame size mean corresponding to the second reference target object and the initial detection frame of the detection target object:

the server may obtain a weight value of the detection frame size mean value corresponding to the second reference target object, and the weight value may be referred to as a first size weighting coefficient. In addition, the server may further obtain a weight value of the initial detection box for the detection target object, and the weight value may be referred to as a second size weighting coefficient.

The first size weighting coefficient and the second size weighting coefficient may both be a numerical value between 0 and 1, a sum of the first size weighting coefficient and the second size weighting coefficient may be equal to 1, and specific values of the first size weighting coefficient and the second size weighting coefficient may be set according to an actual application scenario, which is not limited thereto. For example, the first size weighting factor may be equal to 0.3 and the second size weighting factor may be equal to 0.7.

The server may weight the average of the sizes of the detection frames corresponding to the second reference target object by using the first size weighting coefficient to obtain a value, which may be referred to as a first frame size. That is, the first frame size may be equal to a product between the first size weighting factor and a mean value of the detection frame sizes corresponding to the second reference target object.

The server may further weight a frame size of the initial detection frame of the detection target object (which may be a side length of the initial detection frame of the detection target object) by using the second size weighting coefficient, to obtain a value, which may be referred to as a second frame size. That is, the second frame size may be equal to a product between the second size weighting coefficient and the side length of the initial detection frame for detecting the target object.

The server may generate a transition detection frame for detecting the target object according to the first frame size and the second frame size. The server may use the sum of the first frame size and the second frame size as the side length of the transition detection frame of the detection target object, and thus the transition detection frame of the detection target object may be generated by the sum of the first frame size and the second frame size. The center position of the transition detection frame of the detection target object may be the object center position of the detection target object in the detection image.

Referring to fig. 5, fig. 5 is a schematic view of a scene for acquiring a transition detection frame according to the present application. As shown in fig. 5, the detection image 100d includes therein a target object 101d, a target object 102d, a target object 103d, a target object 104d, a target object 105d, a target object 106d, a target object 107d, a target object 108d, a target object 109d, and a target object 110 d.

As shown in fig. 5, since the target object 101d and the target object 102d are on an approximate horizontal line, the object sizes of the target object 101d and the target object 102d are approximately equal. Since the target object 103d, the target object 104d, and the target object 105d are on an approximate horizontal line, the object sizes of the target object 103d, the target object 104d, and the target object 105d are approximately equal.

Similarly, the imaging distance L4 of the target object 106d, the imaging distance L5 of the target object 107d, the imaging distance L6 of the target object 108d, and the imaging distance L7 of the target object 109d are approximately equal, so that the target object 106d, the target object 107d, the target object 108d, and the target object 109d are on an approximate horizontal line, and the object sizes of the target object 106d, the target object 107d, the target object 108d, and the target object 109d are approximately equal. Since the target object 110d, the target object 111d, and the target object 112d are on an approximate horizontal line, the object sizes of the target object 110d, the target object 111d, and the target object 112d are approximately equal.

Therefore, when the target object 108d is taken as the detection target object, the target objects in the detection image 100d other than the detection target object 108d, including the target object 101d, the target object 102d, the target object 103d, the target object 104d, the target object 105d, the target object 106d, the target object 107d, the target object 109d, and the target object 110d, are auxiliary target objects of the detection target object 108 d.

The above-described third object acquisition number may be equal to 3, and therefore, the 3 auxiliary target objects to which the image pickup distance having the smallest difference in image pickup distance from the image pickup distance L6 of the detection target object 108d belongs in the detection image 100d may be set as the above-described second reference target object. Here, the 3 auxiliary target objects to which the imaging distance difference between the 3 auxiliary target objects closest to the imaging distance between the imaging distance L6 of the detection target object 108d and the imaging distance L6 of the detection target object 108d is smallest may include the auxiliary target object 106d, the auxiliary target object 107d, and the auxiliary target object 109d on the approximately same horizontal line as the detection target object (as indicated by the region 113 d).

Further, the server may modify the obtained initial detection frame of the detection target object according to the second reference target object to generate a transition detection frame of the detection target object (as shown in the area 114 d). The specific process of generating the transition detection frame for detecting the target object can be referred to the above description.

Next, a process of how the server modifies the transition detection frame of the detection target object through the above history object detection frame to obtain the target detection frame of the detection target object is described, which may be referred to as a process of applying a time-sequence prior constraint to the transition detection frame of the detection target object (because the process is implemented based on the principle that object sizes of target objects at the same spatial position in images captured at different times are approximately equal):

the number of the historical object detection boxes can be multiple, and the server can obtain the size mean values of the detection boxes corresponding to the multiple historical object detection boxes. Since the history object detection frame may be a square detection frame, the server may use an average of side lengths of the plurality of history object detection frames as a detection frame size average corresponding to the plurality of history object detection frames.

For example, if there are 3 history object detection boxes, the 3 history object detection boxes include a history object detection box 1, a history object detection box 2, and a history object detection box 3. The side length of the history object detection box 1 may be 2, the side length of the history object detection box 2 may be 3, and the side length of the history object detection box 3 may be 4, so that the average value of the sizes of the detection boxes corresponding to the history object detection box may be equal to (2 +3+ 4)/3, that is, equal to 3.

The server may obtain a weight value of the detection frame size mean value corresponding to the historical object detection frame, and the weight value may be referred to as a third size weighting coefficient. In addition, the server may further obtain a weight value of the transition detection box for the detection target object, and the weight value may be referred to as a fourth size weighting coefficient.

The third size weighting coefficient and the fourth size weighting coefficient may both be a numerical value between 0 and 1, a sum of the third size weighting coefficient and the fourth size weighting coefficient may be equal to 1, and specific values of the third size weighting coefficient and the fourth size weighting coefficient may be set according to an actual application scenario, which is not limited thereto. For example, the third size weighting factor may be equal to 0.2 and the fourth size weighting factor may be equal to 0.8.

The server may weight the average of the sizes of the detection frames corresponding to the historical object detection frames by using the third size weighting coefficient to obtain a value, and the value may be referred to as a third frame size. That is, the third frame size may be equal to the product of the third size weighting factor and the mean value of the detection frame sizes corresponding to the historical object detection frames.

The server may further weight, by using the fourth size weighting coefficient, a frame size of the transition detection frame of the detection target object (which may be a side length of the transition detection frame of the detection target object) to obtain a value, which may be referred to as a fourth frame size. That is, the fourth frame size may be equal to the product between the fourth size weighting coefficient and the side length of the transition detection frame of the detection target object.

The server may generate a target detection frame for detecting the target object based on the third frame size and the fourth frame size. The server may use the sum of the third frame size and the fourth frame size as the side length of the target detection frame of the detection target object, and thus the target detection frame of the detection target object may be generated by the sum of the third frame size and the fourth frame size. The center position of the target detection frame of the detection target object may be the object center position of the detection target object in the detection image.

Referring to fig. 6, fig. 6 is a schematic view of a scene for acquiring a target detection frame according to the present application. As shown in fig. 6, the history image 100e includes therein a history object 102e, a history object 104e, a history object 105e, and a history object 108 e. Here, the image position 107e in the history image 100e may be a corresponding position of the detection target object in the history image.

When the above-mentioned first object acquisition number is equal to 3, the server may set, as the above-mentioned reference history object, 3 history objects whose spatial distance from the corresponding position 107e of the detection target object is smallest. As shown in fig. 6, the 3 history objects in the history image 100e, whose spatial distance from the corresponding position 107e of the detection target object is smallest, include the history object 102e, the history object 104e, and the history object 105 e. I.e., the reference history objects include history object 102e, history object 104e, and history object 105e (as shown by area 109 e).

The server may generate a target detection frame for detecting the target object based on the reference history object and the transition detection frame for detecting the target object (as shown in the area 110 e). The specific process of generating the target detection frame for detecting the target object may be as described above.

Optionally, the server may further perform iterative correction on the obtained target detection frame for detecting the target object to obtain a final more accurate target detection frame for detecting the target object, where the iterative correction on the target detection frame mainly includes iterating the horizontal prior constraint process and the time sequence prior constraint process:

the server may correct the transition detection frame to obtain a target detection frame of the detection target object, as an undetermined detection frame of the detection target object, where the undetermined detection frame is equivalent to the initial detection frame, that is, the target detection frame obtained at this time may be used as the initial detection frame of the detection target object again, and the initial detection frame (i.e., the undetermined detection frame) at this time may be corrected again by the same principle through the first reference target object and the second reference target object again to obtain a final target detection frame of the detection target object. Please refer to the following description:

it can be understood that, in the process of obtaining a transition detection frame by using the initial detection frame for detecting the target object and then obtaining a target detection frame for detecting the target object by using the transition detection frame, for a first iteration process (i.e., a first iteration process) of the target detection frame for detecting the target object, the target detection frame for detecting the target object at this time may be denoted as an undetermined detection frame d1 for detecting the target object, and the transition detection frame for detecting the target object in the first iteration process may be denoted as a transition detection frame g1 for detecting the target object:

in the second iteration process, the server may further obtain a transition detection frame of the detection target object generated in the second iteration process through the pending detection frame d1 of the detection target object and the second reference target object, and may mark the transition detection frame of the detection target object in the second iteration process as the transition detection frame g2 of the detection target object. This is the second iteration of the process of the level prior constraint described above.

The process of obtaining the transition detection frame g2 of the detection target object by the server through the pending detection frame d1 of the detection target object and the second reference target object is the same as the process of obtaining the transition detection frame g1 of the detection target object through the initial detection frame of the detection target object and the second reference target object, in the process, the initial detection frame of the detection target object used in the operation is replaced by the pending detection frame d1 of the detection target object, and in the process, the object initial detection frame of the second reference target object used in the operation can be replaced by the pending detection frame d1 of the second reference target object. The obtaining mode of the pending detection frame d1 of the second reference target object is the same as the obtaining mode of the pending detection frame d1 of the detection target object.

Next, the server may further obtain an undetermined detection frame of the detection target object generated in the second iteration process through the transition detection frame g2 of the detection target object and the history object detection frame, where the undetermined detection frame may be understood as the target detection frame of the detection target object generated in the second iteration process, and the undetermined detection frame may be denoted as the undetermined detection frame d2 of the detection target object. The server obtains the pending detection frame d2 of the detection target object generated in the second iteration through the transition detection frame g2 of the detection target object and the history object detection frame, and the process of obtaining the pending detection frame d1 of the detection target object generated in the first iteration through the transition detection frame g1 of the detection target object and the history object detection frame is the same as the process of obtaining the pending detection frame d1 of the detection target object through the transition detection frame g1 of the detection target object, except that the transition detection frame g1 of the detection target object is replaced with the transition detection frame g2 of the detection target object during calculation. This is the second iteration of the above described process of timing prior constraints.

A threshold number of iterations may be set, where the threshold number indicates a number of iterations that are required to be performed on a target detection box that detects a target object. For example, if the threshold number of times may be 2, the pending detection frame d2 of the detection target object may be directly used as the final target detection frame for detecting the target object.

Or comparing the size between the undetermined detection frame (for example, the undetermined detection frame d 2) of the detection target object obtained in the current iteration process and the undetermined detection frame of the detection target object obtained in the previous iteration process, for example, the undetermined detection frame d1, to determine whether the undetermined detection frame of the detection target object obtained in the current iteration process is stable, if so, using the undetermined detection frame of the detection target object obtained in the current iteration process as the final target detection frame of the detection target object, and if not, continuing the iteration.

If the difference between the side length of the undetermined detection frame of the detected target object obtained in the current iteration process and the side length of the undetermined detection frame of the detected target object obtained in the previous iteration process is less than or equal to the size difference threshold (the numerical value can be set by itself), the undetermined detection frame of the detected target object obtained in the current iteration process can be considered to be stable. Otherwise, if the difference between the side length of the undetermined detection frame of the detected target object obtained in the current iteration process and the side length of the undetermined detection frame of the detected target object obtained in the previous iteration process is greater than the size difference threshold, the undetermined detection frame of the detected target object obtained in the current iteration process is considered to be unstable, the next iteration can be continued until the undetermined detection frame with the stable detected target object is obtained, and the stable undetermined detection frame can be used as the final target detection frame for detecting the target object.

For better understanding of the solution, the process of performing the third iteration on the target detection frame for detecting the target object is described as follows: the server may obtain, through the obtained pending detection frame d2 of the detection target object and the second reference target object, a transition detection frame of the detection target object generated in the third iteration process, and may mark the transition detection frame of the detection target object in the third iteration process as the transition detection frame g3 of the detection target object. This is the third iteration of the process of the level prior constraint described above.

The process of obtaining the transition detection frame g3 of the detection target object by the server through the pending detection frame d2 of the detection target object and the second reference target object is the same as the process of obtaining the transition detection frame g1 of the detection target object through the initial detection frame of the detection target object and the second reference target object, in the process, the initial detection frame of the detection target object used in the operation is replaced by the pending detection frame d2 of the detection target object, and in the process, the object initial detection frame of the second reference target object used in the operation can be replaced by the pending detection frame d2 of the second reference target object. The obtaining mode of the pending detection frame d2 of the second reference target object is the same as the obtaining mode of the pending detection frame d2 of the detection target object.

Next, the server may further obtain an undetermined detection frame of the detection target object generated in the third iteration process through the transition detection frame g3 of the detection target object and the history object detection frame, where the undetermined detection frame may be understood as the target detection frame of the detection target object generated in the third iteration process, and the undetermined detection frame may be denoted as the undetermined detection frame d3 of the detection target object. The server obtains the pending detection frame d3 of the detection target object generated in the third iteration through the transition detection frame g3 of the detection target object and the history object detection frame, and the process of obtaining the pending detection frame d1 of the detection target object generated in the first iteration through the transition detection frame g1 of the detection target object and the history object detection frame is the same as the process of obtaining the pending detection frame d1 of the detection target object through the transition detection frame g1 of the detection target object, except that the transition detection frame g1 of the detection target object is replaced with the transition detection frame g3 of the detection target object during calculation. This is the process of iterating the process of timing prior constraint described above for the third time.

The above process of iterating the target detection frame for detecting the target object may be summarized as the following description:

in the x (x is a positive integer) iteration process, the server may generate a transition detection frame gx for the detection target object (the value after g represents the corresponding iteration number, here, the x-1 iteration) according to the imaging distance of each target object to the imaging apparatus (the second reference target object may be obtained by the imaging distance), the historical object detection frame, and the to-be-detected frame dx-1 for the detection target object generated in the x-1 iteration process (the value after d represents the corresponding iteration number, here, the x-1 iteration); if the (x-1) th iteration process is the first iteration process (i.e., the first iteration process), the undetermined detection frame dx-1 of the detection target object is obtained by the pixel distance between the detection target object and each auxiliary target object, i.e., the undetermined detection frame d1 in the 1 st iteration process is obtained by the initial detection frame of the detection target object.

Next, the server may generate a pending detection frame dx for the detection target object according to the history object detection frame and the transition detection frame gx.

When the size difference between the frame size of the frame dx to be detected and the frame size of the frame dx-1 to be detected is smaller than or equal to the size difference threshold, the frame dx to be detected can be considered to be stable, and the frame dx to be detected can be used as the final target detection frame for detecting the target object.

When the size difference between the frame size of the to-be-detected frame dx and the frame size of the to-be-detected frame dx-1 is greater than the size difference threshold, in the (x + 1) th iteration process, the server may generate a transition detection frame gx +1 for the detection target object according to the imaging distance of each target object for the imaging device (i.e., according to the second reference target object) and the to-be-detected frame dx.

The server can generate an undetermined detection frame dx +1 aiming at the detection target object according to the historical object detection frame and the transition detection frame gx + 1;

when the size difference between the frame size of the frame dx +1 to be detected and the frame size of the frame dx to be detected is smaller than or equal to the size difference threshold, the frame dx +1 to be detected is considered to be stable, and the frame dx +1 to be detected can be used as the final target detection frame for detecting the target object.

If the size difference between the frame size of the frame dx +1 to be detected and the frame size of the frame dx to be detected is greater than the size difference threshold, the frame dx +1 to be detected is considered to be unstable, and the next iteration process needs to be continued on the frame dx +1 to be detected.

Step S105, generating a target object density map aiming at the detection target object according to the target detection frame;

specifically, after obtaining the final target detection frame of the detected target object, the server may determine a gaussian kernel standard deviation through the target detection frame. Wherein, the smaller the target detection frame of the detection target object, the less the influence of the detection target object on the density values of the pixels around the center position of the object in the detection image is, and a gaussian kernel with the smaller standard deviation is required. Conversely, the larger the target detection frame for detecting the target object, the greater the influence of the detected target object on the density values of the pixels around the center position of the object in the detection image, and a gaussian kernel with a larger standard deviation is required.

Therefore, it can be understood that the size of the target detection frame for detecting the target object is always proportional to the standard deviation of the gaussian kernel corresponding to the detection target object. The larger the size of the target detection frame for detecting the target object is, the larger the standard deviation of the corresponding gaussian kernel is, and the smaller the size of the target detection frame for detecting the target object is, the smaller the standard deviation of the corresponding gaussian kernel is. The gaussian kernel having the gaussian kernel standard deviation determined by the above-described target detection box may be referred to as a target gaussian kernel to detect a target object. The target gaussian kernel is a normalized gaussian kernel.

In other words, the server may set the size of the standard deviation of the target gaussian kernel corresponding to the size of the target detection frame of the detection target object according to the size of the target detection frame of the detection target object, and the larger the size of the target detection frame of the detection target object is, the larger the standard deviation of the target gaussian kernel may be set, so as to more accurately cover the range of the target detection frame of the detection target object by the target gaussian kernel, and convolve the pixels in the target detection frame of the detection target object to obtain a more accurate density map, as described below.

The target detection frame may be a square, and the size of the target detection frame may be the side length of the target detection frame, so that a value obtained by multiplying the side length of the target detection frame by a certain coefficient (which is a constant, and a specific numerical value may be determined according to an actual application scenario, for example, 0.3) may be used as the value of the standard deviation of the target gaussian kernel.

The server may further generate an initial object density map for the detection image according to the object center position of each target object in the detection image: the server may traverse each pixel point in the detection image, set a pixel value of a pixel point at the object center position of the traversed target object to 1 (this "1" may be referred to as a first pixel value), and set a pixel value of a pixel point at the object center position of the traversed target object that is not the target object to 0 (this "0" may be referred to as a second pixel value). Therefore, the server may generate the initial object density map according to the first pixel value or the second pixel value set for each pixel point in the traversed detection image, where the initial object density map has the same size as the detection image.

Then, the server may perform a convolution operation on the initial object density map through the normalized target gaussian kernel of the detection target object to generate a density map for the detection target object, which may be referred to as a target object density map of the detection target object. The density map may also be referred to as a thermodynamic map, which is a density heat map visualized by a density function, and the distribution density of the detection target object in the detection image can be visually checked by the thermodynamic map of the detection target object.

Referring to fig. 7, fig. 7 is a schematic view of a scene for obtaining a density map according to the present application. As shown in fig. 7, the server may obtain a target gaussian kernel 101f by detecting a target detection box 100f of the target object.

The server may traverse each pixel point in the detection image 102f, and set a pixel value of a pixel point traversed to be located at an object center position of the target object to 1, and set a pixel value of a pixel point traversed to be not located at an object center position of the target object to 0, to generate an initial object density map 103f of the detection image.

The detection image 102f includes 9 pixel points, and the 9 pixel points are respectively a pixel point (r), a pixel point (c), and a pixel point (c). The third pixel point, the fifth pixel point and the ninth pixel point are located at the object center position of the target object, and therefore, in the initial object density map 103f, the pixel value of the pixel point at the position of the 1 st row and 3 rd column corresponding to the third pixel point, the pixel value of the pixel point at the position of the 2 nd row and 2 nd column corresponding to the fifth pixel point, and the pixel value of the pixel point at the position of the 3 rd row and 3 rd column corresponding to the ninth pixel point are all 1 (i.e., the first pixel value).

The pixel point (i), the pixel point (ii), the pixel point (iv), the pixel point (c) and the pixel point (b) are not located at the object center position of the target object, and therefore, in the initial object density map 103f, the pixel value of the pixel point at the 1 st row and 1 st column position corresponding to the pixel point (i), the pixel value of the pixel point at the 1 st row and 2 nd column position corresponding to the pixel point (ii), the pixel value of the pixel point at the 2 nd row and 3 rd column position corresponding to the pixel point (c), the pixel value of the pixel point at the 3 rd row and 1 st column position corresponding to the pixel point (c), and the pixel value of the pixel point at the 3 rd row and 2 nd column position corresponding to the pixel point (b) are all 0 (i.e., the second pixel value).

Therefore, the server may perform a convolution operation on the initial object density map 103f through the target gaussian kernel 101f, so as to obtain a target object density map 104f for the detection target object. The server may acquire the target object density map of each target object in the detection image 102f by a method of acquiring the target object density map 104f of the detection target object.

It can be understood that, if a model (for example, an object density detection model) is directly made to learn the initial object density map, the difficulty of model learning is very high, because the model can only learn to predict the pixel value of the pixel point at the object center position of the target object to be 1, and predict all the pixel values of the pixel points at other positions to be 0, which is very difficult. Therefore, after the initial object density map is convolved by the target gaussian kernel to obtain the target object density map, not only the pixel value of the pixel point at the object center position of the target object is 1 in the target object density map, but also the pixel value of the pixel point at the periphery of the object center position of the detection target object may not be 0 (the specific pixel value is determined by the adopted target gaussian kernel). In the target object density map, the closer to the target object, the larger the pixel value of the pixel point at the object center position is, and the farther from the target object, the smaller the pixel value of the pixel point at the object center position is. Therefore, the model is trained by the target object density map, which can greatly reduce the difficulty of model learning.

Further, since the target gaussian kernel is normalized, 1 can be obtained by integrating the target object density map obtained by convolution of the target gaussian kernel, where 1 represents that the number of detection target objects is 1, that is, that the detection target objects represent 1 person's head.

More, the server may obtain the target object density map corresponding to each target object in the detection image in the same manner as the target object density map of the detection target object is obtained. The size of the target object density map corresponding to each target object is the same. The server can superpose the target object density maps corresponding to each target object, and the density map corresponding to the detected image can be obtained. The target object density maps corresponding to each target object may be superimposed by adding pixel values at the same pixel point position in the target object density map corresponding to each target object. For example, the target object may include the target object 1 and the target object 2, and then the pixel value of the pixel point at the 1 st line in the target object density map of the target object 1 may be added to the pixel value of the pixel point at the 1 st line in the target object density map of the target object 2, so as to obtain the pixel value of the pixel point at the 1 st line in the density map corresponding to the detected image.

The server can train an object density detection model through the obtained density map of the detection images, wherein the number of the detection images can be multiple. The object density detection model trained by detecting the density map of the image can be called a target detection model. The object detection model may be used to detect an object density of an object in an image.

For example, if the object density of the object to be detected (which may be the head of the user) in the image to be detected needs to be detected, the image to be detected may be input into a target detection model, and an object density map for the object to be detected may be generated by the target detection model. The object density map may be a density map of the image to be detected, and therefore, the number of objects to be detected in the image to be detected (i.e., the number of people in the image to be detected) can be obtained by performing an integration operation on the object density map. The object density of the object to be detected in the image to be detected can be obtained by comparing the number of the objects with the area of the real scene shot in the image to be detected.

Therefore, the method provided by the application can obtain the target detection frame with more accurate detection target object according to the pixel distance between the detection target object and each auxiliary target object, the shooting distance of each target object for the shooting equipment and the cooperation of the historical object detection frame, further generate the target object density map with more accurate detection target object through the more accurate target detection frame, and train the target detection model with higher model precision through the target object density map.

Referring to fig. 8, fig. 8 is a schematic flow chart of an estimation frame obtaining method provided in the present application. As shown in fig. 8, the method may include:

step S201, inputting an image and all head central points;

specifically, the image may refer to the detection image, and the head center point may refer to an object center position marked on the target object in the detection image. The server may input the detection image and the object center position of each target object labeled in the detection image into a calculation process to generate a target detection frame for each target object in the detection image.

Step S202, estimating an N adjacent initial frame;

specifically, preferably, the server may estimate the initial detection frame of the target object by using an N-neighbor initial frame estimation method (i.e., the above-mentioned method based on the pixel distances between the detection target objects and the auxiliary target objects, respectively), for example, the initial detection frame may include the initial detection frame of the detection target object.

Step S203, applying horizontal prior constraint;

specifically, the server may then apply the horizontal prior constraint to the obtained initial detection frame of the target object to correct the obtained initial detection frame of the target object, so as to obtain a transition detection frame of the target object. For example, the transition detection frame may include the transition detection frame for detecting the target object described above.

Step S204, applying time sequence prior constraint;

specifically, the server may apply the time-series prior constraint to the obtained transition detection frame of the target object to correct the obtained initial detection frame of the target object, so as to obtain an undetermined detection frame of the target object. For example, the pending detection frame may include the pending detection frame for detecting the target object.

Step S205, whether the human head frame is stable or not;

specifically, the human head box may refer to the obtained undetermined detection box, and the content described in step S104 may be iteratively corrected for the undetermined detection box obtained each time. If the undetermined detection frame in the current iteration process is already stable, the following step S206 may be executed, and if the undetermined detection frame in the current iteration process is not yet stable, the step S203 may be executed again in an iteration manner.

Step S206, outputting an estimation frame;

specifically, when the human head frame is already stable, the undetermined detection frame of the currently obtained target object may be used as the final generated target detection frame of the target object, and the final target detection frame may be output. Here, the target detection frame may be a square detection frame, and thus, outputting the target detection frame may refer to outputting a size (e.g., a side length) of the target detection frame and a pixel coordinate at a center position of the target detection frame.

Therefore, the method provided by the application can obtain the target detection frame with more accurate detection target object according to the pixel distance between the detection target object and each auxiliary target object, the image pickup distance of each target object to the image pickup device and the cooperation of the historical object detection frame.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an image data processing apparatus provided in the present application. As shown in fig. 9, the image data processing apparatus 1 may include: the device comprises an image acquisition module 11, a distance determination module 12, a history frame acquisition module 13, a target frame generation module 14 and a density map generation module 15;

an image acquisition module 11 configured to acquire a detection image including at least two target objects based on an image capturing apparatus; the at least two target objects are composed of a detection target object and an auxiliary target object; detecting the center position of an object marked with each target object in the image;

a distance determination module 12 configured to determine, according to an object center position of each target object, a pixel distance between each detection target object and each auxiliary target object, and an image pickup distance of each target object for the image pickup apparatus;

a history frame acquisition module 13 configured to acquire a history object detection frame of a history object indicated by an object center position of the detection target object from the history image based on the history image obtained by the image capturing apparatus;

a target frame generation module 14 configured to generate a target detection frame for the detection target object based on a pixel distance between the detection target object and each auxiliary target object, an image pickup distance of each target object for the image pickup apparatus, and the history object detection frame;

and the density map generating module 15 is configured to generate a target object density map for the detection target object according to the target detection frame.

For specific functional implementation manners of the image obtaining module 11, the distance determining module 12, the history frame obtaining module 13, the target frame generating module 14, and the density map generating module 15, please refer to steps S101 to S105 in the embodiment corresponding to fig. 3, which is not described herein again.

The distance determining module 12 includes: a pixel distance determination unit 121 and an imaging distance determination unit 122;

a pixel distance determining unit 121, configured to obtain the number of interval pixels between the object center position of the detection target object and the object center position of each auxiliary target object, and determine the pixel distance between the detection target object and each auxiliary target object according to the number of interval pixels to which each auxiliary target object belongs;

an imaging distance determination unit 122 configured to determine a vertical direction distance of each target object in the detection image according to the object center position of each target object, and determine an imaging distance of each target object with respect to the imaging apparatus according to the vertical direction distance to which each target object belongs.

For specific functional implementation of the pixel distance determining unit 121 and the image capturing distance determining unit 122, please refer to step S102 in the corresponding embodiment of fig. 3, which is not described herein again.

The history frame obtaining module 13 includes: a history object sorting unit 131, a history object acquiring unit 132, and a history frame determining unit 133;

a history object sorting unit 131, configured to sort the at least one history object according to a pixel distance between an object center position of the at least one history object in the history image and an object center position of the detection target object, respectively, to obtain at least one sorted history object;

a history object obtaining unit 132, configured to select a reference history object from the sorted at least one history object according to the first object obtaining number; the number of reference history objects is less than or equal to the first object acquisition number;

a history frame determination unit 133 for determining an object detection frame for labeling the reference history object in the history image as a history object detection frame.

For specific functional implementation manners of the history object sorting unit 131, the history object obtaining unit 132, and the history frame determining unit 133, please refer to step S103 in the corresponding embodiment of fig. 3, which is not described herein again.

The target frame generation module 14 includes: an initial frame generating unit 141, a transition frame generating unit 142, and a target frame generating unit 143;

an initial frame generating unit 141 configured to generate an initial detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object, respectively;

a transition frame generation unit 142 configured to generate a transition detection frame for detecting a target object, based on an imaging distance of each target object to the imaging apparatus and the initial detection frame;

a target frame generating unit 143 configured to generate a target detection frame for detecting the target object according to the history object detection frame and the transition detection frame.

For a specific implementation manner of functions of the initial frame generating unit 141, the transition frame generating unit 142, and the target frame generating unit 143, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The initial frame generating unit 141 includes: a first ordering sub-unit 1411, a first object acquiring sub-unit 1412, a distance mean acquiring sub-unit 1413, an initial size acquiring sub-unit 1414, and an initial frame generating sub-unit 1415;

a first sorting subunit 1411, configured to sort, according to a pixel distance between the detection target object and each auxiliary target object, at least one auxiliary target object, so as to obtain at least one sorted auxiliary target object;

a first object obtaining subunit 1412, configured to select a first reference target object from the sorted at least one auxiliary target object according to the second object obtaining number; the number of the first reference target objects is less than or equal to the second object acquisition number;

a distance average acquisition subunit 1413 configured to acquire a distance average of pixel distances between the first reference target object and the detection target object;

an initial size obtaining subunit 1414, configured to weight the distance average according to the frame size weighting coefficient, so as to obtain an initial frame size;

an initial frame generation subunit 1415 is configured to generate an initial detection frame according to the initial frame size.

For a specific implementation manner of functions of the first sorting subunit 1411, the first object obtaining subunit 1412, the distance average obtaining subunit 1413, the initial size obtaining subunit 1414 and the initial frame generating subunit 1415, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The transition frame generating unit 142 includes: a second sorting subunit 1421, a second object obtaining subunit 1422, a size average obtaining subunit 1423, and a transition frame generating subunit 1424;

a second sorting subunit 1421, configured to sort, according to the difference between the shooting distance to which the detection target object belongs and the shooting distance to which each auxiliary target object belongs, at least one auxiliary target object, to obtain at least one sorted auxiliary target object;

a second object obtaining subunit 1422, configured to select, according to the third object obtaining number, a second reference target object from the sorted at least one auxiliary target object; the number of the second reference target objects is less than or equal to the third object acquisition number;

a size average obtaining subunit 1423, configured to obtain a detection frame size average of an object initial detection frame to which the second reference target object belongs;

the transition frame generation subunit 1424 is configured to generate a transition detection frame according to the detection frame size mean value corresponding to the second reference target object and the initial detection frame.

For a specific function implementation manner of the second sorting subunit 1421, the second object obtaining subunit 1422, the size average obtaining subunit 1423, and the transition frame generating subunit 1424, please refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

The transition frame generation subunit 1424 includes: a first coefficient acquiring subunit 14241, a first weighting subunit 14242, a second weighting subunit 14243, and a first frame generating subunit 14244;

a first coefficient obtaining subunit 14241, configured to obtain a first size weighting coefficient of a detection frame size mean value corresponding to the second reference target object, and obtain a second size weighting coefficient of the initial detection frame;

the first weighting subunit 14242 is configured to weight the detection frame size average value corresponding to the second reference target object according to the first size weighting coefficient, so as to obtain a first frame size;

a second weighting subunit 14243, configured to weight the frame size of the initial detection frame according to the second size weighting coefficient, to obtain a second frame size;

a first frame generation subunit 14244, configured to generate a transition detection frame according to the first frame size and the second frame size.

For a specific implementation manner of functions of the first coefficient obtaining sub-unit 14241, the first weighting sub-unit 14242, the second weighting sub-unit 14243, and the first frame generating sub-unit 14244, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The target frame generating unit 143 includes: a frame mean acquisition sub-unit 1431, a second coefficient acquisition sub-unit 1432, a third weighting sub-unit 1433, a fourth weighting sub-unit 1434, and a second frame generation sub-unit 1435;

a frame mean value obtaining subunit 1431, configured to obtain a size mean value of a detection frame corresponding to the historical object detection frame;

a second coefficient obtaining subunit 1432, configured to obtain a third size weighting coefficient for the detection frame size mean value corresponding to the historical object detection frame, and obtain a fourth size weighting coefficient for the transition detection frame;

a third weighting subunit 1433, configured to weight, according to the third size weighting coefficient, the detection frame size average value corresponding to the historical object detection frame, so as to obtain a third frame size;

a fourth weighting subunit 1434, configured to weight the frame size of the transition detection frame according to a fourth size weighting coefficient, so as to obtain a fourth frame size;

a second frame generation subunit 1435 is configured to generate the target detection frame according to the third frame size and the fourth frame size.

For a specific function implementation manner of the frame mean obtaining subunit 1431, the second coefficient obtaining subunit 1432, the third weighting subunit 1433, the fourth weighting subunit 1434, and the second frame generating subunit 1435, please refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

The density map generation module 15 includes: an initial density map generation unit 151, a gaussian kernel determination unit 152, and a convolution unit 153;

an initial density map generating unit 151 for generating an initial object density map for the detection image from the object center position of each target object;

a gaussian kernel determining unit 152, configured to determine a gaussian kernel standard deviation based on the target detection box, and determine a target gaussian kernel for detecting the target object according to the gaussian kernel standard deviation;

a convolution unit 153, configured to perform a convolution operation on the object initial density map based on the target gaussian kernel, and generate a target object density map for the detection target object.

For a detailed implementation of the functions of the initial density map generation unit 151, the gaussian kernel determination unit 152, and the convolution unit 153, please refer to step S105 in the corresponding embodiment of fig. 3, which is not described herein again.

The initial density map generation unit 151 includes: the traverse sub-unit 1511 and the density map generation sub-unit 1512;

a traversal subunit 1511, configured to traverse at least two pixel points in the detection image, set a pixel value of a pixel point located at an object center position of each target object that is traversed to a first pixel value, and set a pixel value of a pixel point that is not located at an object center position of each target object that is traversed to a second pixel value;

the density map generating subunit 1512 is configured to generate an initial object density map according to the first pixel value and the second pixel value set for the at least two pixel points.

For a specific implementation manner of the functions of the traversal subunit 1511 and the density map generating subunit 1512, please refer to step S105 in the embodiment corresponding to fig. 3, which is not described herein again.

The target frame generation module 14 includes: a first frame generating unit 144, a second frame generating unit 145, a first frame determining unit 146, a third frame generating unit 147, a fourth frame generating unit 148, and a second frame determining unit 149;

a first frame generating unit 144, configured to generate, in the x-th iteration, a transition detection frame gx for the detection target object according to the imaging distance of each target object for the imaging apparatus, the historical object detection frame, and the pending detection frame dx-1 for the detection target object generated in the x-1 th iteration; if the x-1 iteration process is the first iteration process, the frame dx-1 to be detected is obtained based on the pixel distance between the detection target object and each auxiliary target object;

a second frame generating unit 145, configured to generate a pending detection frame dx for the detection target object according to the historical object detection frame and the transition detection frame gx;

a first frame determining unit 146, configured to determine the frame dx to be detected as the target detection frame when a size difference between the frame size of the frame dx to be detected and the frame size of the frame dx-1 to be detected is less than or equal to a size difference threshold;

a third frame generating unit 147, configured to, when a size difference between a frame size of the to-be-detected frame dx and a frame size of the to-be-detected frame dx-1 is greater than a size difference threshold, generate a transition detection frame gx +1 for the detection target object according to the imaging distance of each target object for the imaging device and the to-be-detected frame dx in an x +1 th iteration process;

a fourth frame generating unit 148, configured to generate an undetermined detection frame dx +1 for the detection target object according to the historical object detection frame and the transition detection frame gx + 1;

the second frame determining unit 149 is configured to determine the frame dx +1 to be detected as the target detection frame when a size difference between the frame size of the frame dx +1 to be detected and the frame size of the frame dx to be detected is less than or equal to a size difference threshold.

For specific functional implementation manners of the first frame generating unit 144, the second frame generating unit 145, the first frame determining unit 146, the third frame generating unit 147, the fourth frame generating unit 148, and the second frame determining unit 149, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein, above-mentioned device 1 still includes: a model training module 16, an input module 17 and an integration module 18;

the model training module 16 is used for training an object density detection model based on the target object density graph and determining the trained object density detection model as a target detection model;

an input module 17, configured to input an image to be detected including an object to be detected into a target detection model, and generate an object density map for the object to be detected in the target detection model;

and the integrating module 18 is configured to perform an integrating operation on the object density map to obtain the number of objects to be detected in the image to be detected.

For a specific implementation manner of the functions of the model training module 16, the input module 17, and the integration module 18, please refer to step S105 in the corresponding embodiment of fig. 3, which is not described herein again.

Therefore, the method provided by the application can obtain the target detection frame with more accurate detection target object according to the pixel distance between the detection target object and each auxiliary target object, the shooting distance of each target object for the shooting equipment and the cooperation of the historical object detection frame, and further can generate the target object density map with more accuracy for the detection target object through the more accurate target detection frame.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device provided in the present application. As shown in fig. 10, the computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the description of the image data processing method in the corresponding embodiment of fig. 3. It should be understood that the computer device 1000 described in this application can also perform the description of the image data processing apparatus 1 in the embodiment corresponding to fig. 9, and the description is not repeated here. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the aforementioned computer program executed by the image data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the image data processing method in the embodiment corresponding to fig. 3 can be performed, and therefore, the description will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto but rather by the claims appended hereto.

Claims

1. An image data processing method characterized by comprising:

acquiring a detection image including at least two target objects based on an image pickup apparatus; the at least two target objects are composed of a detection target object and an auxiliary target object; the center position of an object marked with each target object in the detection image;

acquiring a history image based on the image pickup apparatus, and acquiring a history object detection frame of a history object indicated by an object center position of the detection target object from the history image;

generating a target detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object, the image pickup distance of each target object for the image pickup apparatus, and the history object detection frame;

2. The method according to claim 1, wherein the determining, in accordance with the object center position of each target object, a pixel distance between the detection target object and each auxiliary target object, respectively, and an imaging distance of each target object for the imaging apparatus includes:

acquiring the number of interval pixels between the object center position of the detection target object and the object center position of each auxiliary target object, and determining the pixel distance between the detection target object and each auxiliary target object according to the number of interval pixels to which each auxiliary target object belongs;

and determining the vertical direction distance of each target object in the detection image according to the object center position of each target object, and determining the image pickup distance of each target object for the image pickup equipment according to the vertical direction distance to which each target object belongs.

3. The method according to claim 1, wherein the acquiring, from the history image, a history object detection frame of the history object indicated by the object center position of the detection target object includes:

sequencing at least one history object according to the pixel distance between the object center position of at least one history object in the history image and the object center position of the detection target object respectively to obtain the sequenced at least one history object;

selecting a reference historical object from the at least one sorted historical object according to the acquisition quantity of the first objects; the number of the reference history objects is less than or equal to the first object acquisition number;

and determining an object detection frame used for marking the reference history object in the history image as the history object detection frame.

4. The method according to claim 1, wherein the generating a target detection frame for the detection target object based on the pixel distance between the detection target object and each of the auxiliary target objects, the imaging distance of each of the target objects for the imaging apparatus, and the history object detection frame, respectively, comprises:

generating an initial detection frame aiming at the detection target object according to the pixel distance between the detection target object and each auxiliary target object respectively;

generating a transition detection frame aiming at the detection target object according to the image pickup distance of each target object aiming at the image pickup equipment and the initial detection frame;

and generating a target detection frame aiming at the detection target object according to the historical object detection frame and the transition detection frame.

5. The method according to claim 4, wherein the generating an initial detection frame for the detection target object according to the pixel distance between the detection target object and each auxiliary target object respectively comprises:

sequencing at least one auxiliary target object according to the pixel distance between the detection target object and each auxiliary target object respectively to obtain at least one sequenced auxiliary target object;

selecting a first reference target object from the at least one sequenced auxiliary target object according to the acquisition quantity of the second objects; the number of the first reference target objects is less than or equal to the second object acquisition number;

obtaining a distance mean of pixel distances between the first reference target object and the detection target object;

weighting the distance average value according to a frame size weighting coefficient to obtain an initial frame size;

and generating the initial detection frame according to the initial frame size.

6. The method according to claim 4, wherein the generating a transition detection frame for the detection target object based on the imaging distance of each target object to the imaging apparatus and the initial detection frame includes:

sequencing at least one auxiliary target object according to the shooting distance difference between the shooting distance to which the detection target object belongs and the shooting distance to which each auxiliary target object belongs to obtain the sequenced at least one auxiliary target object;

selecting a second reference target object from the at least one sequenced auxiliary target object according to the acquisition quantity of the third objects; the number of the second reference target objects is less than or equal to the third object acquisition number;

obtaining a detection frame size mean value of an object initial detection frame to which the second reference target object belongs;

and generating the transition detection frame according to the detection frame size mean value corresponding to the second reference target object and the initial detection frame.

7. The method of claim 6, wherein generating the transition detection box according to the detection box size mean corresponding to the second reference target object and the initial detection box comprises:

acquiring a first size weighting coefficient of a detection frame size mean value corresponding to the second reference target object, and acquiring a second size weighting coefficient of the initial detection frame;

weighting the detection frame size mean value corresponding to the second reference target object according to the first size weighting coefficient to obtain a first frame size;

weighting the frame size of the initial detection frame according to the second size weighting coefficient to obtain a second frame size;

and generating the transition detection frame according to the first frame size and the second frame size.

8. The method of claim 4, wherein generating a target detection box for the detection target object according to the historical object detection box and the transition detection box comprises:

acquiring a detection frame size mean value corresponding to the historical object detection frame;

acquiring a third size weighting coefficient of a detection frame size mean value corresponding to the historical object detection frame, and acquiring a fourth size weighting coefficient of the transition detection frame;

according to the third size weighting coefficient, weighting the detection frame size mean value corresponding to the historical object detection frame to obtain a third frame size;

weighting the frame size of the transition detection frame according to the fourth size weighting coefficient to obtain a fourth frame size;

and generating the target detection frame according to the third frame size and the fourth frame size.

9. The method of claim 1, wherein generating a target object density map for the detected target object according to the target detection box comprises:

generating an initial object density map for the detection image according to the object center position of each target object;

determining a Gaussian kernel standard deviation based on the target detection frame, and determining a target Gaussian kernel aiming at the detection target object according to the Gaussian kernel standard deviation;

and performing convolution operation on the object initial density map based on the target Gaussian core to generate the target object density map aiming at the detection target object.

10. The method of claim 9, wherein generating an initial object density map for the detection image based on the object center position of each target object comprises:

traversing at least two pixel points in the detection image, setting the pixel value of the traversed pixel point at the object center position of each target object as a first pixel value, and setting the pixel value of the traversed pixel point which is not at the object center position of each target object as a second pixel value;

and generating the initial object density map according to the first pixel value and the second pixel value set for the at least two pixel points.

11. The method according to claim 1, wherein the generating a target detection frame for the detection target object based on the pixel distance between the detection target object and each of the auxiliary target objects, the imaging distance of each of the target objects for the imaging apparatus, and the history object detection frame, respectively, comprises:

in the x iteration process, generating a transition detection frame gx aiming at the detection target object according to the image pickup distance of each target object aiming at the image pickup equipment, the historical object detection frame and the undetermined detection frame dx-1 aiming at the detection target object generated in the x-1 iteration process; if the x-1 iteration process is the first iteration process, the frame dx-1 to be detected is obtained based on the pixel distance between the detection target object and each auxiliary target object;

generating a pending detection frame dx aiming at the detection target object according to the historical object detection frame and the transition detection frame gx;

when the size difference between the frame size of the frame dx to be detected and the frame size of the frame dx-1 to be detected is smaller than or equal to the size difference threshold, determining the frame dx to be detected as the target detection frame;

when the size difference between the frame size of the frame dx to be detected and the frame size of the frame dx-1 to be detected is larger than the size difference threshold, in the (x + 1) th iteration process, generating a transition detection frame gx +1 for the detection target object according to the image pickup distance of each target object for the image pickup device and the frame dx to be detected;

generating an undetermined detection frame dx +1 aiming at the detection target object according to the historical object detection frame and the transition detection frame gx + 1;

and when the size difference between the frame size of the frame dx +1 to be detected and the frame size of the frame dx to be detected is smaller than or equal to the size difference threshold, determining the frame dx +1 to be detected as the target detection frame.

12. The method of claim 1, further comprising:

training an object density detection model based on the target object density graph, and determining the trained object density detection model as a target detection model;

inputting an image to be detected comprising an object to be detected into the target detection model, and generating an object density map for the object to be detected in the target detection model;

and performing integral operation on the object density graph to obtain the number of the objects of the object to be detected in the image to be detected.

13. An image data processing apparatus characterized by comprising:

an image acquisition module configured to acquire a detection image including at least two target objects based on an image pickup apparatus; the at least two target objects are composed of a detection target object and an auxiliary target object; the center position of an object marked with each target object in the detection image;

a distance determination module, configured to determine, according to an object center position of each target object, a pixel distance between each detection target object and each auxiliary target object, and an image pickup distance of each target object for the image pickup apparatus;

a history frame acquisition module configured to acquire a history image based on the image capturing apparatus, and acquire a history object detection frame of a history object indicated by an object center position of the detection target object from the history image;

a target frame generation module configured to generate a target detection frame for the detection target object according to a pixel distance between the detection target object and each auxiliary target object, an imaging distance of each target object for the imaging apparatus, and the history object detection frame;

and the density map generation module is used for generating a target object density map aiming at the detection target object according to the target detection frame.

14. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1-12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-12.