CN109815936B

CN109815936B - Target object analysis method and device, computer equipment and storage medium

Info

Publication number: CN109815936B
Application number: CN201910130040.0A
Authority: CN
Inventors: 颜佺
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2023-08-22
Anticipated expiration: 2039-02-21
Also published as: CN109815936A

Abstract

The embodiment of the application provides a target object analysis method and device, computer equipment and storage medium, wherein the method comprises the following steps: determining shooting information of the acquired video to be analyzed; wherein the video to be analyzed contains at least one image of a target object; determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model.

Description

Target object analysis method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the field of computer vision communication, in particular to a target object analysis method and device, computer equipment and a storage medium.

Background

Crowd analysis is a popular application field of intelligent security, crowd density and crowd foreground images of video frames can be detected in a crowd counting technology based on a depth convolutional neural network in related technologies, head and shoulder information of the crowd in the video frames is counted, and the crowd density images are output according to the head and shoulder information; based on the implementation principle, the technology is difficult to apply in different video scenes; if the human body area in the video frame is too large, the output counting result is too much, the segmentation inaccuracy of the foreground image can be caused by the similarity of the background color and the target color in the scene, and the final output result can be influenced by the scene angle of the video frame.

Disclosure of Invention

In view of this, the embodiments of the present application provide a target object analysis method and apparatus, a computer device, and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a target object analysis method, which comprises the following steps:

determining shooting information of the acquired video to be analyzed; wherein the video to be analyzed contains at least one image of a target object;

determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model.

In the above method, the shooting information of the video to be analyzed includes: and the shooting scene to which the video to be analyzed belongs and/or the shooting period of the video to be analyzed.

In the above method, before the determining, according to the shooting information, a preset model for processing the video to be analyzed, the method further includes:

Adopting the preset target counting model or the preset target counting model as an initialization model;

correspondingly, according to the shooting information, a preset model for processing the video to be analyzed is adjusted.

In the above method, the adjusting the preset model for processing the video to be analyzed according to the shooting information includes:

if the scene to which the video to be analyzed belongs is contained in a preset scene set and/or the period to which the video to be analyzed belongs is within a preset period, the initialization model is adjusted to be the preset target detection model;

processing the video to be analyzed by adopting the preset target detection model;

and if the scene to which the video to be analyzed belongs is not contained in a preset scene set and the period to which the video to be analyzed belongs is not within a preset period, adjusting the initialization model into the preset target counting model.

In the above method, before determining the preset model for processing the video to be analyzed according to the shooting information, the method further includes:

and decoding the video to be analyzed by adopting a video decoder to obtain continuous multi-frame images.

In the above method, determining that the video to be analyzed is processed by using the preset target detection model to determine the total number of target objects contained in the video to be analyzed includes:

scanning each frame of image of the multi-frame image by adopting the preset target detection model to obtain the physical characteristics of each target object;

generating a detection frame of each target object according to the physical characteristics of each target object;

and determining the total number of target objects contained in the video to be analyzed according to the number of the detection frames.

In the above method, the scanning each frame of image of the multi-frame image by using the preset target detection model to obtain the physical characteristics of each target object includes:

and scanning each frame of image by adopting a preset target detection model according to a preset step length, and determining the physical characteristics of each target object appearing in each frame of image.

In the above method, the multi-frame image includes M frame images, M is an integer greater than or equal to 2, and the generating the detection frame of each target object according to the physical feature of each target object includes:

Scanning an ith frame image of the M frame images by using a preset target detection model, and determining physical characteristics of N target objects contained in the ith frame image; wherein i and N are integers greater than 0, and i is less than or equal to M;

if the physical characteristics of the j-th target object of the N target objects are different from the physical characteristics of the target objects in other frame images except the i-th frame image, generating detection frames of the j target objects; wherein j is an integer greater than 0 and less than or equal to N.

In the above method, the determining, by using the preset target counting model, processes the video to be analyzed to determine the total number of target objects contained in the video to be analyzed, including:

processing the video to be analyzed by using a preset target counting model to obtain a foreground segmentation map of the video to be analyzed and an object group density map of the video to be analyzed;

and determining the total number of target objects contained in the video to be analyzed according to the foreground segmentation map and the target object group density map.

In the above method, the processing the video to be analyzed by using a preset target counting model to obtain a foreground segmentation map of the video to be analyzed and an object group density map of the video to be analyzed includes:

Performing edge detection on each frame of image in the multi-frame images of the video to be analyzed by using a preset target counting model, and determining an area covered by the head of each target object in each frame of image;

dividing the target object and the background in each frame of image to obtain a foreground division map of each frame of image;

and generating an object group density map for representing the density of the target object in each frame image according to the area covered by the head of each target object in each frame image.

In the above method, determining the total number of target objects contained in the video to be analyzed according to the number of the detection frames includes:

if the number of the detection frames in the ith frame image of the video to be analyzed is greater than a preset number threshold, switching the preset target detection model into the preset target counting model;

and processing the first residual video which is not processed by the preset target detection model in the video to be analyzed by using the preset target counting model to obtain the total number of the target objects.

In the above method, the processing the first remaining video that is not processed by the preset target detection model in the video to be analyzed by using the preset target counting model to obtain the total number of the target objects includes

Determining a foreground segmentation sub-graph of the first residual video and an object group density sub-graph of the video to be analyzed by utilizing the preset target counting model;

determining a second number of target objects contained in the first residual video according to the foreground segmentation sub-graph and the object group density sub-graph;

the second number is determined as the total number of the target objects.

In the above method, the determining, according to the foreground segmentation map and the target object group density map, a total number of target objects included in the video to be analyzed includes:

determining a third number of target objects contained in an L-th frame image according to a foreground segmentation map of the L-th frame image in the multi-frame image and an object group density map of the target objects in the L-th frame image; wherein L is an integer greater than 0;

if the third number is smaller than a preset number threshold, switching the preset target counting model to the preset target detection model;

and processing second residual videos which are not processed by the preset target counting model in the videos to be analyzed by using the preset target detection model to obtain the total number of the target objects.

In the above method, the processing the second remaining video that is not processed by the preset target counting model in the video to be analyzed by using the preset target detection model to obtain the total number of the target objects includes:

determining a sub-detection frame of each target object in the second residual video by using the preset target detection model;

determining a fourth number of target objects contained in the second residual video according to the number of the sub-detection frames;

the fourth number is determined as the total number of the target objects.

In the above method, the method further comprises:

and generating an alarm event matched with the time length and the numerical range according to the numerical range of the total number of the target objects and the time length of which the number of the target objects in the video to be analyzed is the total number.

The embodiment of the application provides a target object analysis device, which comprises: the device comprises a first acquisition module and a first determination module, wherein:

the first acquisition module is used for determining shooting information of the acquired video to be analyzed; wherein the video to be analyzed contains at least one image of a target object;

The first determining module is used for determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model.

In the above device, the shooting information of the video to be analyzed includes: and the shooting scene to which the video to be analyzed belongs and/or the shooting period of the video to be analyzed.

In the above apparatus, the apparatus further includes:

the first initialization module is used for adopting the preset target counting model or the preset target counting model as an initialization model;

correspondingly, the first determining module includes: and the first adjusting sub-module is used for adjusting a preset model for processing the video to be analyzed according to the shooting information.

In the above apparatus, the first adjustment submodule includes:

the first judging unit is used for adjusting the initialization model to the preset target detection model if the scene to which the video to be analyzed belongs is contained in a preset scene set and/or the period to which the video to be analyzed belongs is within a preset period;

The first processing unit is used for processing the video to be analyzed by adopting the preset target detection model;

and the second judging unit is used for adjusting the initialization model to the preset target counting model if the scene to which the video to be analyzed belongs is not included in a preset scene set and the period to which the video to be analyzed belongs is not within a preset period.

In the above apparatus, the apparatus further includes:

and the first decoding module is used for decoding the video to be analyzed by adopting a video decoder to obtain continuous multi-frame images.

In the above apparatus, when determining that the video to be analyzed is processed by using the preset object detection model, the first determining module includes:

the first scanning sub-module is used for scanning each frame of image of the multi-frame image by adopting the preset target detection model to obtain the physical characteristics of each target object;

the first generation sub-module is used for generating a detection frame of each target object according to the physical characteristics of each target object;

and the first determining submodule is used for determining the total number of target objects contained in the video to be analyzed according to the number of the detection frames.

In the above apparatus, the first scanning sub-module includes:

the first scanning unit is used for scanning each frame of image according to a preset step length by adopting a preset target detection model and determining the physical characteristics of each target object appearing in each frame of image.

In the above apparatus, the multi-frame image includes M frame images, M is an integer greater than or equal to 2, and the first generating sub-module includes:

the second scanning unit is used for scanning an ith frame image of the M frame images by using a preset target detection model and determining physical characteristics of N target objects contained in the ith frame image; wherein i and N are integers greater than 0, and i is less than or equal to M;

a first generating unit configured to generate detection frames of j target objects if body features of the j target objects of the N target objects are different from body features of target objects in other frame images than the i-th frame image; wherein j is an integer greater than 0 and less than or equal to N.

In the above apparatus, when determining that the video to be analyzed is processed by using the preset target count model, the first determining module includes:

The second judging sub-module is used for processing the video to be analyzed by utilizing a preset target counting model to obtain a foreground segmentation map of the video to be analyzed and an object group density map of the video to be analyzed;

and the second determining submodule is used for determining the total number of target objects contained in the video to be analyzed according to the foreground segmentation map and the target object group density map.

In the above apparatus, the second judging submodule includes:

the first detection unit is used for carrying out edge detection on each frame of image in the multi-frame images of the video to be analyzed by utilizing a preset target counting model, and determining the area covered by the head of each target object in each frame of image;

the first segmentation unit is used for segmenting the target object and the background in each frame of image to obtain a foreground segmentation map of each frame of image;

and the second generation unit is used for generating an object group density map for representing the density of the target object in each frame image according to the area covered by the head of each target object in each frame image.

In the above apparatus, the first determining submodule includes:

a first switching unit, configured to switch the preset target detection model to the preset target counting model if the number of detection frames in the i-th frame image of the video to be analyzed is greater than a preset number threshold;

And the second processing unit is used for processing the first residual video which is not processed by the preset target detection model in the video to be analyzed by utilizing the preset target counting model to obtain the total number of the target objects.

In the above device, the second processing unit comprises

The first determining subunit is used for determining a foreground segmentation sub-graph of the first residual video and an object group density sub-graph of the video to be analyzed by utilizing the preset target counting model;

a second determining subunit, configured to determine, according to the foreground segmentation sub-graph and the object group density sub-graph, a second number of target objects included in the first residual video;

and a third determining subunit configured to determine the second number as a total number of the target objects.

In the above apparatus, the second determining submodule includes:

a first determining unit, configured to determine a third number of target objects included in an L-th frame image according to a foreground segmentation map of the L-th frame image in the multi-frame image and an object group density map of the target objects in the L-th frame image; wherein L is an integer greater than 0;

the second switching unit is used for switching the preset target counting model into the preset target detection model if the third number is smaller than a preset number threshold;

And the second processing unit is used for processing second residual videos which are not processed by the preset target counting model in the videos to be analyzed by utilizing the preset target detection model to obtain the total number of the target objects.

In the above apparatus, the second processing unit includes:

a fourth determining subunit, configured to determine a sub-detection frame of each target object in the second residual video by using the preset target detection model;

a fifth determining subunit, configured to determine, according to the number of sub-detection frames, a fourth number of target objects included in the second remaining video;

a sixth determining subunit configured to determine the fourth number as a total number of the target objects.

In the above apparatus, the apparatus further includes:

and the first alarm module is used for generating alarm events matched with the time length and the numerical range according to the numerical range of the total number of the target objects and the time length of which the total number of the target objects is the total number in the video to be analyzed.

Correspondingly, the embodiment of the application provides a computer storage medium, and the computer storage medium stores computer executable instructions, and after the computer executable instructions are executed, the steps in the target object analysis method provided by the embodiment of the application can be realized.

The embodiment of the application provides a computer device, which comprises a memory and a processor, wherein the memory stores computer executable instructions, and the processor can realize the steps in the target object analysis method provided by the embodiment of the application when running the computer executable instructions on the memory.

The embodiment of the application provides a target object analysis method and device, computer equipment and storage medium, wherein shooting information of an acquired video to be analyzed is firstly determined; wherein the video to be analyzed contains at least one image of a target object; then, according to the shooting information, determining a preset model for processing the video to be analyzed so as to determine the total number of target objects contained in the video to be analyzed; therefore, the method and the device can be used for determining the total number of the target objects in the video to be analyzed through different model preset models under different conditions, and the total number of the target objects can be determined more accurately in fewer images of the target objects.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

FIG. 1A is a schematic diagram of a network architecture according to an embodiment of the present application;

FIG. 1B is a schematic diagram of an implementation flow of a target object analysis method according to an embodiment of the present application;

FIG. 2A is a schematic diagram of a process for implementing a target object analysis method according to an embodiment of the present application;

FIG. 2B is a flowchart illustrating another implementation of the target object analysis method according to an embodiment of the present application;

FIG. 2C is a flowchart illustrating another implementation of the target object analysis method according to an embodiment of the present application;

FIG. 2D is a flowchart illustrating another implementation of the target object analysis method according to an embodiment of the present application;

FIG. 2E is a flowchart illustrating another implementation of the target object analysis method according to the embodiment of the present application;

FIG. 3 is a schematic diagram of a flow chart of an implementation of a target object analysis method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a composition structure of a target object analysis device according to an embodiment of the present application;

fig. 5 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the specific technical solutions of the present application will be given with reference to the accompanying drawings in the embodiments of the present application. The following examples are illustrative of the application and are not intended to limit the scope of the application.

In this embodiment, a network architecture is provided first, and fig. 1A is a schematic diagram of a composition structure of the network architecture according to an embodiment of the present application, as shown in fig. 1A, the network architecture includes two or more computer devices 11 to 1N and a server 30, wherein the computer devices 11 to 1N interact with the server 31 through a network 21. The computer device may be various types of computer devices with information processing capabilities in the course of implementation, such as a cell phone, tablet computer, desktop computer, personal digital assistant, etc.

The embodiment provides a target object analysis method, which can detect human bodies by adopting a target detection model in videos with fewer people, and when the number of people is greater, the number of target objects in the videos to be processed is determined by adopting a target counting model, so that the more accurate determination of the number of people in the videos with fewer people is ensured.

Fig. 1B is a schematic flow chart of an implementation of a target object analysis method according to an embodiment of the present application, as shown in fig. 1B, the method includes the following steps:

step S101, determining the acquired shooting information of the video to be analyzed.

Here, the video to be analyzed includes at least one image of a target object, for example, a video including a plurality of figures in a subway scene, a video including sheep flocks in a sheep playing scene, and the like. The step S101 may be implemented by a computer device, and further, the computer device may be an intelligent terminal, for example, a mobile terminal device with wireless communication capability such as a mobile phone (e.g., a mobile phone), a tablet computer, a notebook computer, or an intelligent terminal device inconvenient to move such as a desktop computer. The computer device is used for image recognition or processing.

Step S102, determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed.

Here, the shooting information of the video to be analyzed includes: the shooting scene to which the video to be analyzed belongs and/or the shooting period of the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model. The preset target detection model can be a model obtained by training the neural network in any target detection mode. For example, by traversing the sample image at a certain pixel interval, head features (face, hair, body or the like) in the sample image are obtained, and a detection frame of each person in the sample image is determined; this detection frame is compared with the detection frames of the persons in the known sample images, so that the training of the preset target detection model is completed. The step S102 may be understood as counting the number of the target objects included in the video to be analyzed by using different preset models according to different shooting scenes to which the video to be analyzed belongs and/or different shooting periods of the video to be analyzed, for example, if the scene to which the video to be analyzed belongs is included in a preset scene set and/or the period to which the video to be analyzed belongs is within a preset period (i.e., a video satisfying a preset condition), the initialization model is adjusted to be the preset target detection model; if the scene to which the video to be analyzed belongs is not contained in the preset scene set, and the period to which the video to be analyzed belongs is not within the preset period (i.e. the video which does not meet the preset condition), adjusting the initialization model into a preset target counting model, and if the initialization model is the preset target counting model, continuing to process the video to be analyzed by adopting the preset target counting model.

In the actual implementation process, outputting the total number of the target objects in the computer equipment; the total number of the target objects can be the total number of the target objects output on a display screen of the computer equipment, or the total number of the target objects can be output to other equipment by the computer equipment, namely the total number of the target objects is sent to other equipment, for example, the other equipment can be an intelligent terminal of a user.

In the embodiment of the application, whether the shooting information meets a specific scene is judged by analyzing the shooting information of the input video to be analyzed, and if so, the number of target objects in the video is determined by adopting a target detection model, so that the total number of the target objects in the image with fewer target objects is ensured to be more accurately determined, and the target detection model is adopted under the condition of sparse crowd in off-peak time, thereby more efficiently utilizing hardware resources.

The embodiment further provides a target object analysis method, fig. 2A is a schematic flow chart of implementation of the target object analysis method according to the embodiment of the application, and as shown in fig. 2A, the method includes the following steps:

step S201, adopting the preset target counting model or the preset target counting model as an initialization model.

Here, the step S201 may be understood as that after the video to be analyzed is acquired, an initialization model for determining the number of target objects in the video to be analyzed is a preset target count model, that is, after the video to be analyzed is acquired, a default model is a preset target count model, then shooting information of the video to be shot is determined, and whether to switch the preset target count model to a preset target detection model is determined based on the shooting information.

Step S202, determining shooting information of the acquired video to be analyzed.

Here, after step S202, the shooting information is judged, if the scene to which the video to be analyzed belongs is included in a preset scene set, and the period to which the video to be analyzed belongs is within a preset period, step S203 is entered; if the scene to which the video to be analyzed belongs is not included in the preset scene set, or the period to which the video to be analyzed belongs is not within the preset period, it is determined that the shooting information does not meet the preset condition, and step S205 is performed.

Step S203, if the scene to which the video to be analyzed belongs is included in a preset scene set, and/or the period to which the video to be analyzed belongs is within a preset period, adjusting the initialization model to the preset target detection model.

Here, the scene to which the video to be analyzed belongs and the period to which the video to be analyzed belongs, which are included in the photographing information, are judged, if the scene to which the video to be analyzed belongs is included in a preset scene set and/or the period to which the video to be analyzed belongs is within a preset period, it is determined that the photographing information meets a preset condition, and a preset target counting model serving as an initialization model is adjusted to a preset target detection model. The preset scene set may include: subway scenes, supermarket scenes, campus scenes and the like; the video meeting the preset condition may be understood as a video containing a small number of people. For example, it is determined that videos shot in a subway scene after ten pm do not meet the preset condition, and obviously, the number of people contained in the videos in the scene is less, and the videos are switched to the target detection model to detect the human body, so that the number of people in the videos is determined more accurately. Determining that videos shot in subway scenes in early rush hour are videos which do not meet preset conditions, wherein the number of people in the period is obviously large, and counting target objects by adopting a preset target counting model in the embodiment; therefore, a target detection model is adopted for videos with fewer people, and a preset target counting model is adopted for videos with more people, so that the counted people are more accurate. An image of a target object in the video to be analyzed corresponds to a detection frame. For example, the target object is a person, a detection frame is generated for each person in the video, and the number of people in the video can be determined by counting the number of detection frames.

And step S204, processing the video to be analyzed by adopting the preset target detection model.

Step S205, if the scene to which the video to be analyzed belongs is not included in the preset scene set and the period to which the video to be analyzed belongs is not within the preset period, adjusting the initialization model to the preset target count model.

Here, the scene to which the video to be analyzed belongs is not included in the preset scene set, and the period to which the video to be analyzed belongs is not within the preset period, which indicates that the video has more target objects. For example, if the scene is a subway scene, the preset period is a peak period, and a seven-point subway scene in the morning, i.e., if the video to be analyzed is a video in the seven-point subway scene in the morning, it is determined that the photographing information of the video does not satisfy the preset condition. If the scene is a bar, the preset time period is from the tenth point at night to the fourth point in the early morning, namely if the video to be analyzed is a video in the bar scene from the tenth point at night to the fourth point in the early morning, which indicates that the number of people in the video is large, the shooting information of the video is determined to not meet the preset condition. Step S205, if the scene to which the video to be analyzed belongs is not included in the preset scene set and the period to which the video to be analyzed belongs is not within the preset period, it is indicated that there are more target objects in the video, and the initialization model (i.e., the preset target counting model) is continuously adopted to count the target objects.

In this embodiment, by analyzing the input shooting information of the video to be analyzed, whether the shooting information meets a specific scene is determined, if yes, a default preset model is switched to a preset target detection model to determine how many target objects are in the video, so that the total number of target objects in the video to be analyzed can be determined through different model preset models under different conditions, and the total number of target objects in fewer images of the target objects can be determined more accurately.

The present embodiment further provides a target object analysis method, and fig. 2B is a schematic flow chart of another implementation of the target object analysis method according to the embodiment of the present application, as shown in fig. 2B, where the method includes the following steps:

step S221, determining the shooting information of the acquired video to be analyzed.

Step S222, adopting a video decoder to decode the video to be analyzed to obtain continuous multi-frame images.

Here, the multi-frame image includes M-frame images, M being an integer of 2 or more.

The above step S222 provides a method for decoding the video to be analyzed, after the video to be analyzed is obtained, the video to be analyzed is firstly decoded to obtain continuous multi-frame images, and then whether the shooting information meets the preset condition is judged, so that each frame of image is processed by adopting a corresponding preset model.

Step S223, if the scene to which the video to be analyzed belongs is included in a preset scene set, and/or the period to which the video to be analyzed belongs is within a preset period, adjusting the initialization model to the preset target detection model.

Step S224, scanning each frame of image of the multiple frames of images by using the preset target detection model, so as to obtain the physical characteristics of each target object.

Here, scanning each frame of image according to a preset step length by using a preset target detection model, and determining the physical characteristics of each target object appearing in each frame of image; that is, after determining that the shooting information of the video meets the preset condition, the preset target counting model is switched to the preset target detection model, and then each frame of image of the multi-frame image is scanned according to a certain step length by adopting the preset target detection model to obtain the physical characteristic of each target object (for example, each frame of image is traversed according to a certain pixel interval to determine the physical characteristic of each target object).

Step S225, generating a detection frame of each target object according to the physical characteristics of each target object.

Here, the step S225 may be implemented by the following procedure:

the first step, an ith frame image of the M frame images is scanned by using a preset target detection model, and physical characteristics of N target objects contained in the ith frame image are determined.

Here, i and N are integers greater than 0, and i is equal to or less than M.

And a second step of generating a detection frame of the j target objects if the physical characteristics of the j target objects of the N target objects are different from the physical characteristics of the target objects in the other frame images except the i frame image.

Here, j is an integer greater than 0 and equal to or less than N. If the physical characteristics of one target object in N target objects in one frame image are different from those of the target objects in other frames, the target object is not shown in other frames, namely, the target object does not generate corresponding detection frames in other frames, so that the detection frames are generated for the target object in the frame, each target object in the video can be ensured to correspond to one detection frame, the condition that one target object corresponds to a plurality of detection frames is not caused, and the accuracy of determining the total number of target objects based on the number of the detection frames is ensured.

Step S226, determining the total number of target objects contained in the video to be analyzed according to the number of the detection frames.

After the decoding of the video to be analyzed is provided in the steps S223 to S226, a manner of "if the scene to which the video to be analyzed belongs is included in a preset scene set and/or the period to which the video to be analyzed belongs is within a preset period, adjusting the initialization model to be the preset target detection model" is implemented, in this manner, by detecting the frame by frame of the decoded multi-frame image, a detection frame corresponding to each target object is obtained, so as to determine the total number of target objects in the video to be analyzed.

In this embodiment, the number of people in the video is determined by switching the initialized preset model, i.e., the preset target counting model, to the target detection model under the condition that the number of people in the video is few, and in this case, the use of the target detection model is a better choice, and short plates with inaccurate counting of the target counting model in video scenes such as sparse crowd and larger targets can be effectively compensated, so that energy consumption of target analysis can be reduced.

The present embodiment further provides a target object analysis method, and fig. 2C is a schematic flow chart of another implementation of the target object analysis method according to the embodiment of the present application, as shown in fig. 2C, where the method includes the following steps:

In step S231, the acquired shooting information of the video to be analyzed is determined.

And step S232, decoding the video to be analyzed by adopting a video decoder to obtain continuous multi-frame images.

Step S233, if the scene to which the video to be analyzed belongs is not included in the preset scene set and the period to which the video to be analyzed belongs is not within the preset period, processing the video to be analyzed by using a preset target counting model to obtain a foreground segmentation map of the video to be analyzed and an object group density map of the video to be analyzed.

Here, the step S233 may be implemented by the following procedure:

firstly, decoding the video to be analyzed by adopting a video decoder to obtain continuous multi-frame images; secondly, carrying out edge detection on each frame of image in the multi-frame image by using a preset target counting model, and determining an area covered by the head of each target object in each frame of image; for example, the target object is a person, and a pixel area occupied by the head of each person in each frame of image is determined; thirdly, dividing the target object and the background in each frame of image to obtain a foreground division map of each frame of image; and finally, generating an object group density map for representing the density of the target object in each frame image according to the area covered by the head of each target object in each frame image.

Step S234, determining the total number of target objects contained in the video to be analyzed according to the foreground segmentation map and the target object group density map.

Here, if the target object in the image to be analyzed is a crowd, step S224 may be to estimate the size of the head where the position is located according to the known position of each head, so as to obtain a coverage area of the head, convert the area into a probability that the area may be a head, and obtain a crowd density map when the sum of the probabilities of the area is 1; after the crowd density map is obtained, the crowd number can be obtained by integrating (summing) the density map. It is obvious that in the present embodiment, the density map may be determined in other ways, and the total number of target objects may be determined based on the density map in other ways; for example, a pixel area occupied by the head of each person is determined, and a density map is determined based on the pixel area, thereby obtaining the total population.

In the embodiment, whether the video is a sparse video of the target object is primarily determined by judging whether the scene information of the video to be analyzed meets the preset condition, and if so, a preset target detection model is adopted; if not, a preset target counting model is adopted, for example, under a subway scene, the crowd density is high in the morning and evening peaks, the crowd density is large in quantity, the target counting model can be used, and the crowd sparseness is switched to a target detection model in the off-peak period, so that hardware resources can be utilized more efficiently.

The present embodiment further provides a target object analysis method, and fig. 2D is a schematic flow chart of another implementation of the target object analysis method according to the embodiment of the present application, as shown in fig. 2D, where the method includes the following steps:

step S241, determining the acquired shooting information of the video to be analyzed.

Step S242, if the scene to which the video to be analyzed belongs is included in a preset scene set, and/or the period to which the video to be analyzed belongs is within a preset period, scanning an ith frame image of the M frame images by using a preset object detection model, and determining body features of N target objects included in the ith frame image.

Here, the physical feature may be feature points of the body of the target object, such as facial feature points, upper body feature points, and lower body feature points, from which the detection frame of the target object can be determined. i and N are integers greater than 0, and i is less than or equal to M. Step S232 may be understood as scanning the video to be analyzed frame by frame to determine the physical characteristics of each target object.

Step S243, if the physical characteristics of the j-th target object of the N target objects are different from the physical characteristics of the target objects in the other frame images except the i-th frame image, generating detection frames of the j target objects.

Here, j is an integer greater than 0 and equal to or less than N.

The above steps S242 and S243 provide a manner of implementing "if the scene to which the video to be analyzed belongs is included in a preset scene set, and/or if the period to which the video to be analyzed belongs is within a preset period, switching the preset target counting model to a preset target detection model to process the video to be analyzed, so as to obtain a detection frame of each target object in the video to be analyzed", where in the manner, body features of each target object are obtained by scanning the video after analysis frame by frame, so as to generate the detection frame of the target object.

Step S244, if the number of detection frames in the ith frame image is greater than a preset number threshold, switching the preset target detection model to the preset target count model.

Here, first, a first number of target objects contained in the i-th frame image is determined according to the number of detection frames in the i-th frame image; and then, if the first number is larger than a preset number threshold, switching the preset target detection model into the preset target counting model. The step S234 may be understood as that if the scene to which the video to be analyzed belongs is included in a preset scene set, and/or the period to which the video to be analyzed belongs is within a preset period, the preset target counting model is switched to a preset target detection model to detect the video, and when the number of detection frames in the ith frame of image is detected to be greater, the method directly switches to the preset target counting model to count the number of target objects in the remaining multiframe of image. For example, although the video is a subway scene with ten points at night, the passenger flow of the subway is still large because the video is holiday, at this time, although the shooting information meets the preset condition, the video is automatically detected by adopting the preset target detection model, but once the video is detected to have a large number of people, the video is switched to the preset target counting model suitable for the scene with a large number of people.

Step S245, processing the first remaining video that is not processed by the preset target detection model in the video to be analyzed by using the preset target counting model, so as to obtain the total number of the target objects.

Here, when the preset target detection model detects that the number of people is large, switching to the preset target counting model; processing the residual video by using a preset target counting model, and firstly, determining a foreground segmentation sub-graph of the first residual video and an object group density sub-graph of the video to be analyzed; then, determining a second number of target objects contained in the first residual video according to the foreground segmentation sub-graph and the object group density sub-graph; finally, determining the second number as the total number of the target objects; therefore, the model for counting the number of the target objects is performed according to the number of the target objects in each frame of image in time, so that the task of analyzing the target object group can be more accurately, flexibly and efficiently completed, and the problems that the number and the density of people in an actual security scene are changed in real time, the people flows in different scenes at different moments can be large, the view angles of some scenes can be changed (such as videos shot by a high-speed dome camera), and the actual analysis requirements can be hardly met by using a single analysis technology are solved.

In this embodiment, if the preset target detection model detects that the number of target objects in a certain frame of image is large, automatically switching to a target counting model to count the target objects in the rest frames of images, and taking the number of target objects counted by the final target counting model as the total number of target objects of the video to be analyzed; therefore, crowd analysis can be realized more accurately, flexibly and efficiently.

In other embodiments, after determining the total number of target objects, the method further comprises the following alerting process:

Here, the alarm process may be understood as that if the scene of the video to be analyzed is a scene of an airport, the total number of target objects is hundreds of thousands of people, and the people continue to stay at the airport for several hours, it may be determined that the airport is too densely populated and has a resident population, and then, alarm information of an excessively densely populated event and alarm information of a resident population event are generated to prompt airport staff for population evacuation.

In this embodiment, for the numerical range of the total number of the target objects and the duration of the total number of the target objects, a corresponding alarm event is generated to prompt the user to make a corresponding process on the alarm event, so that events such as people's mouth is too dense or people's residence can be effectively processed.

The present embodiment further provides a target object analysis method, and fig. 2E is a schematic flow chart of still another implementation of the target object analysis method according to the embodiment of the present application, as shown in fig. 2E, where the method includes the following steps:

step S251, determining the acquired shooting information of the video to be analyzed.

Step S252, if the scene to which the video to be analyzed belongs is not included in the preset scene set, and the period to which the video to be analyzed belongs is not within the preset period, continuing to perform edge detection on each frame of image in the multi-frame image by using the preset target counting model, and determining the area covered by the head of each target object in each frame of image.

Here, before step S252, a video decoder is first used to decode the video to be analyzed, so as to obtain continuous multi-frame images; in step S252, edge detection is performed for each frame in the multi-frame image to determine a pixel region occupied by the head of each target object.

Step S253, segmenting the target object and the background in each frame of image to obtain a foreground segmentation map of each frame of image.

Here, a foreground segmentation map of each frame image is obtained, so that the target object can be highlighted to facilitate statistics of the target object later.

Step S254, generating an object group density map for representing the density of the target object in each frame image according to the area covered by the head of each target object in each frame image.

Step S255, determining a third number of target objects included in the L-th frame image according to the foreground segmentation map of the L-th frame image in the multi-frame image and the object group density map of the target objects in the L-th frame image.

Here, L is an integer greater than 0; the step S255 may be understood as determining the number, i.e., the third number, of the target objects included in the L-th frame image in the multi-frame image.

Step S256, if the third number is smaller than the preset number threshold, the preset target counting model is switched to the preset target detection model.

Here, the step S256 may be understood as that if the shooting information of the video does not meet the preset condition, the preset target counting model is used to count the target objects in the video, and when fewer target objects in the L-th frame image are detected, the method directly switches to the preset target detection model to detect the remaining multi-frame image, so as to determine the number of the target objects therein. For example, although the video is a subway scene with more than eight points in the morning, the subway is stopped in the day, the passenger flow of the subway is small, and at this time, although the shooting information does not meet the preset condition, the video is automatically analyzed by adopting a preset target counting model, but once the number of people is analyzed to be less, the video is switched to a preset target detection model suitable for the scene with less number of people.

Step S257, processing the second remaining video that is not processed by the preset target counting model in the video to be analyzed by using the preset target detection model, to obtain the total number of the target objects.

Here, first, determining a sub-detection frame of each target object in the second residual video by using the preset target detection model; then, determining a fourth number of target objects contained in the second residual video according to the number of the sub-detection frames; finally, the fourth number is determined as the total number of target objects.

In this embodiment, if the preset target counting model analyzes that the number of target objects in a certain frame of image is smaller, automatically switching to a target detection model to count the target objects in the rest of the frames of images, and taking the number of target objects counted by the final target detection model as the total number of target objects of the video to be analyzed; therefore, short plates with inaccurate counting in video scenes such as sparse crowd and large human body targets in the crowd counting technology can be effectively made up, and the energy consumption of the crowd analysis technology can be reduced.

In the related technology, crowd density and crowd foreground images in video frames can be calculated through means such as deep learning, the crowd density and crowd foreground images can analyze information such as the number of people and crowd stagnation in a monitoring area, the information can guide and control the flow of people in the monitoring area, the high-density crowd can be assisted to be split, and the method has a large application value in public security in public places such as stations, squares and markets. However, if the human body area in the video frame is too large, the output counting result is too much, the background color and the target color in the scene are similar, the foreground image is not segmented accurately, and the scene angle of the video frame also affects the final output result.

Aiming at the problem that the existing crowd counting technology cannot meet the requirements of flexibly, efficiently and accurately completing crowd analysis tasks in multiple scenes and multiple moments, the embodiment of the application provides a target object analysis method, and adopts a more efficient and energy-saving analysis mode aiming at different moments in the same scene, for example, a target detection model is used for replacing a target counting model under the condition that few scene people exist, because the target counting models adopted in the related technology consume hardware resources; under the condition, the target detection model is a better supplement, is widely applied to security monitoring scenes based on the deep convolutional neural network, can obtain a better detection effect in the monitoring scenes, can effectively make up short plates with inaccurate counting in video scenes such as sparse crowd and large human targets in the crowd counting technology, and can also reduce the energy consumption of the crowd analysis technology.

The target object analysis method provided by the embodiment of the application combines the target counting model and the target detection model, can automatically or manually switch the models under different scenes, for example, can switch to the target counting model in high-traffic high-density scenes such as subways or squares, and can switch to a human engine in indoor or larger scenes, so that the targets of the scenes are larger, the shielding among the targets is less, the single detection effect is better, and the obtained number and density are more accurate; or at different moments of the same scene, the analysis model is switched according to the number of people in the scene, for example, in a subway scene, the crowd density is high in the morning and evening peaks, the crowd density is large in the quantity, the target counting model can be used, and the crowd sparseness is switched to human body detection in the off-peak period, so that hardware resources can be utilized more efficiently.

Fig. 3 is a schematic flow chart of an implementation of a target object analysis method according to an embodiment of the present application, as shown in fig. 3, the method includes the following steps:

step S301, determining a model engine for crowd analysis to be adopted by the video according to the initialization parameters of the video to be analyzed.

Here, the pattern engine is used to indicate which model is used to analyze the crowd in the video, and may refer to the identification of the model; for example, the engine of the object detection model is "1"; the engine of the target count model is "0"; the initialization parameter is shooting information of the video. For example, a scene corresponding to the video and a period corresponding to the video are acquired. The step S301 may be understood as that the engine initialization mainly creates a model required for crowd analysis, and loads a corresponding deep learning model (i.e., a target detection model and a target counting model), where the target detection model and the target counting model need to be loaded simultaneously, and an initial working mode of the engine is specified, such as that the traffic of people in a known scene is large, the crowd is dense, and the model can be specified as the target counting model; if the number of people in the scene is small, setting a target detection model to be used if the single target is large, wherein prior information of the number of people in the scene is required to be acquired before analysis; when the engine is initialized, parameters such as an analysis area, an event threshold, a population threshold, head-foot annotation information of a video scene and the like are required to be read, and all the used parameters are read from a configuration file, so that the initialization parameters of the engine can be conveniently and flexibly adjusted.

Step S302, decoding the video to obtain continuous multi-frame images.

Here, the step S302 can be understood as decoding, by a video decoder, video stream data of an offline video or a network camera into continuous video frame data, and we can represent continuous multi-frame images as F in time sequence ₀ ，F ₁ ，F ₂ ，···，F _t 。

Step S303, inputting the multi-frame images sequentially into the model indicated by the model engine in step S301.

Here, if the model engine is "1", that is, the model indicated by the model engine is the target detection model, the process proceeds to step S304; if the model engine is "0", that is, the model indicated by the model engine is identified as the target count model, the process advances to step S305. If the target detection model is specified during initialization, the analysis process mainly comprises modules of target detection, tracking, extraction of human body structural information and the like, the number of detection frames can be directly converted into the number of human groups, the tracking module can output tracking identification and tracking frames of the same human body on a plurality of time-continuous video frames, and the attribute information of the human body can be extracted or not and is determined according to specific service requirements. If the target counting model is designated during initialization, the analysis process mainly comprises the density detection of the crowd and the foreground map segmentation map, and the two models can be completed by the same depth convolution neural network.

Step S304, detecting each frame of the input multi-frame image by adopting a preset target detection model to obtain a detection frame of each person.

Here, the step S304 may be understood as determining the person image in each video by any person image detection method, for example, detecting the face in each frame of video or detecting the person' S body area in each frame of video, so as to generate a detection frame of each person. After step S304, the process advances to step S306. If the model is switched according to the number of people threshold is set, after the step S304 is completed, comparing the current number of people with the number of people threshold, switching according to the principle that the target detection model is still used below the number of people threshold and the target counting model is used above the number of people threshold, wherein the switching is based on the fact that the number of people in a scene is smaller than a certain number of people threshold, and the target detection model is used more accurately under the condition that the number of people is smaller; if the time period switching strategy is set, a target counting model (such as early peak and late peak of a subway station) is used in the time period of the peak crowd, and a target detection model is used in the off-peak time period; therefore, the strategy can be flexibly set by combining with an actual scene, and a better balance is achieved between the analysis effect and the energy consumption.

Step S305, detecting each frame of the input multi-frame video by adopting a preset target counting model to obtain a foreground segmentation map and an object density map of the video.

Here, the foreground segmentation map is used to highlight the population in the video with the background, and the object density map is used to indicate the density of the population in the video.

Step S306, determining the number of portraits contained in the current frame image.

Here, if the target detection model is adopted, determining the number of people contained in the current frame image according to the number of detection frames; if the target counting model is adopted, the number of the portraits contained in the frame image is determined according to the foreground segmentation map and the object density map of the frame video.

Step S307, if the number of the portraits contained in the current frame of video determined by the adopted target detection model is greater than the preset number threshold, the target detection model is switched to the target counting model.

Here, if the number of portraits contained in the current frame image determined by the adopted target counting model is smaller than a preset number threshold, the target counting model is switched to a target detection model; therefore, when the number of people contained in the video is small, the target detection model is adopted to detect the human images in the video, so that the obtained number of people is more accurate.

Step 308, processing the first remaining video which is not processed by the preset target detection model in the video to be analyzed by adopting the preset target counting model, so as to obtain the total number of the target objects.

Step S309, if the number range to which the total number belongs and the number of people in the video are the duration of the final number of people, generating an alarm event matching the duration and the number range, and outputting the alarm event.

Here, for example, the scene to which the video belongs is a scene of a subway port, the final number of people in the video is tens of thousands of people, and the number of people in the video is tens of thousands of people within one hour, so that the situation that the number of people in the subway port is too dense is described, an excessively dense event alarm is generated, and the people stay in the subway port for a long time, and a stay event alarm is generated; and outputs the alarm event to prompt related personnel to conduct drainage work and the like aiming at the situation.

In step S310, it is analyzed whether the decoding of the video is completed.

Here, if the decoding is completed, step S311 is entered; if the decoding is not completed, the process advances to step S302.

Step S311, if the decoding is completed, the target object analysis is ended.

In the embodiment, the model to be adopted by the video is determined by analyzing the shooting scene of the video, so that the problem of insufficient accuracy of single crowd analysis and counting in different scenes is solved, the crowd analysis tasks of more types of scenes can be completed by combining the human body detection technology, and the technical advantages of the two are fully exerted.

An embodiment of the present application provides a target object analysis device, fig. 4 is a schematic diagram of a composition structure of the target object analysis device according to the embodiment of the present application, as shown in fig. 4, the device 400 includes: a first acquisition module 401 and a first determination module 402, wherein:

the first obtaining module 401 is configured to determine the obtained shooting information of the video to be analyzed; wherein the video to be analyzed contains at least one image of a target object;

the first determining module 402 is configured to determine, according to the capturing information, a preset model for processing the video to be analyzed, so as to determine a total number of target objects included in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model.

In the above apparatus, the apparatus further includes:

In the above apparatus, the first adjustment submodule includes:

In the above apparatus, the apparatus further includes:

In the above apparatus, the first scanning sub-module includes:

In the above apparatus, the second judging submodule includes:

In the above apparatus, the first determining submodule includes:

In the above device, the second processing unit comprises

In the above apparatus, the second determining submodule includes:

In the above apparatus, the second processing unit includes:

In the above apparatus, the apparatus further includes:

a first alarm module, configured to generate an alarm event matching with a total number of the target objects and a numerical range according to the total number of the target objects and a duration of the total number of the target objects in the video to be analyzed

It should be noted that the description of the above device embodiments is similar to the description of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the method of the present application.

It should be noted that, in the embodiment of the present application, if the above-mentioned target object analysis method is implemented in the form of a software function module, and sold or used as a separate product, the target object analysis method may also be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing an instant messaging device (which may be a terminal, a server, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the application are not limited to any specific combination of hardware and software.

Accordingly, an embodiment of the present application provides a computer storage medium, where computer executable instructions are stored, where the computer executable instructions, after being executed, can implement steps in the target object analysis method provided by the embodiment of the present application.

The description of the instant computer device and storage medium embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the instant messaging device and the storage medium embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

Fig. 5 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application, and as shown in fig. 5, hardware entities of the computer device 500 include: a processor 501, a communication interface 502 and a memory 503, wherein

The processor 501 generally controls the overall operation of the computer device 500.

The communication interface 502 may enable the computer device to communicate with other terminals or servers over a network.

The memory 503 is configured to store instructions and applications executable by the processor 501, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by the respective modules in the processor 501 and the computer device 500, and may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM).

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or in a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer or a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of target object analysis, the method comprising:

determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model;

determining that the video to be analyzed is processed by adopting the preset target detection model so as to determine the total number of target objects contained in the video to be analyzed, wherein the method comprises the following steps:

scanning each frame of image of the multi-frame images by adopting the preset target detection model to obtain the physical characteristics of each target object;

2. The method according to claim 1, wherein the capturing information of the video to be analyzed includes: and the shooting scene to which the video to be analyzed belongs and/or the shooting period of the video to be analyzed.

3. The method according to claim 1 or 2, characterized in that before said determining a preset model for processing said video to be analyzed from said shooting information, the method further comprises:

4. A method according to claim 3, wherein said adjusting a preset model for processing said video to be analyzed based on said photographing information comprises:

5. The method according to any one of claims 1 to 4, wherein before determining a preset model for processing the video to be analyzed from the photographing information, the method further comprises:

6. The method according to claim 1, wherein the scanning each frame of the multi-frame image with the preset object detection model to obtain the physical feature of each object includes:

7. The method of claim 6, wherein the multi-frame image comprises M-frame images, M being an integer greater than or equal to 2, and wherein generating the detection frame for each target object based on the physical characteristics of each target object comprises:

8. The method according to any one of claims 1 to 5, wherein determining to process the video to be analyzed using the preset object count model to determine a total number of object objects contained in the video to be analyzed comprises:

9. The method according to claim 8, wherein the processing the video to be analyzed using the preset object count model to obtain the foreground segmentation map of the video to be analyzed and the object group density map of the video to be analyzed includes:

10. The method according to claim 1, wherein the processing the first remaining video of the videos to be analyzed, which is not processed by the preset object detection model, by using the preset object count model, obtains the total number of the object objects, including

the second number is determined as the total number of the target objects.

11. The method according to claim 8 or 9, wherein said determining the total number of target objects contained in the video to be analyzed from the foreground segmentation map and the target object group density map comprises:

determining a third number of target objects contained in the L-th frame image according to a foreground segmentation map of the L-th frame image in the multi-frame image of the video to be analyzed and an object group density map of the target objects in the L-th frame image; wherein L is an integer greater than 0;

12. The method according to claim 11, wherein the processing, with the preset object detection model, the second remaining video of the video to be analyzed that is not processed by the preset object count model, to obtain the total number of the object objects includes:

the fourth number is determined as the total number of the target objects.

13. The method according to claim 1, wherein the method further comprises:

14. A target object analysis device, the device comprising: the device comprises a first acquisition module and a first determination module, wherein:

the first determining module is used for determining a preset model for processing the video to be analyzed according to the shooting information so as to determine the total number of target objects contained in the video to be analyzed; the preset model is a preset target detection model or a preset target counting model, and the shooting information of the video to be analyzed processed by the preset target detection model is different from the shooting information of the video to be analyzed processed by the preset target counting model;

When the video to be analyzed is determined to be processed by adopting the preset target detection model, the first determining module comprises:

15. The apparatus of claim 14, wherein the capturing information of the video to be analyzed comprises: and the shooting scene to which the video to be analyzed belongs and/or the shooting period of the video to be analyzed.

16. The apparatus according to claim 14 or 15, characterized in that the apparatus further comprises:

17. The apparatus of claim 16, wherein the first adjustment sub-module comprises:

18. The apparatus according to any one of claims 14 to 17, further comprising:

19. The apparatus of claim 14, wherein the first scanning sub-module comprises:

20. The apparatus of claim 19, wherein the multi-frame image comprises an M-frame image, M being an integer greater than or equal to 2, the first generation sub-module comprising:

21. The apparatus according to any one of claims 14 to 18, wherein when determining to process the video to be analyzed using the preset object count model, the first determining module includes:

22. The apparatus of claim 21, wherein the second determination submodule comprises:

23. The apparatus of claim 14, wherein the second processing unit comprises

24. The apparatus of claim 21 or 22, wherein the second determination submodule comprises:

a first determining unit, configured to determine a third number of target objects included in an L-th frame image according to a foreground segmentation map of the L-th frame image in the multi-frame image of the video to be analyzed and an object group density map of the target objects in the L-th frame image; wherein L is an integer greater than 0;

25. The apparatus of claim 24, wherein the second processing unit comprises:

26. The apparatus of claim 14, wherein the apparatus further comprises:

27. A computer storage medium having stored thereon computer executable instructions which, when executed, are capable of carrying out the method steps of any one of claims 1 to 13.

28. A computer device comprising a memory having stored thereon computer executable instructions and a processor which when executed performs the method steps of any of claims 1 to 13.