WO2022205632A1

WO2022205632A1 - Target detection method and apparatus, device and storage medium

Info

Publication number: WO2022205632A1
Application number: PCT/CN2021/102202
Authority: WO
Inventors: 韩志伟; 刘诗男; 杨昆霖; 侯军; 伊帅
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2021-03-31
Filing date: 2021-06-24
Publication date: 2022-10-06
Also published as: TW202240471A; CN113011371A

Abstract

The present disclosure relates to a target detection method and apparatus, a device and a storage medium. The target detection method comprises: obtaining position change information of at least one pixel in a first image with respect to a corresponding pixel in a previous image frame, the first image being an image frame in a video to be detected; obtaining an image feature of the first image as a first feature; obtaining a second feature on the basis of the position change information; performing enhancement processing on the first feature on the basis of the second feature to generate a fusion feature; and determining the detection result of a target object in the first image according to the fusion feature.

Description

Object detection method, device, device and storage medium

Cross-reference to related applications

This application claims the filing of a Chinese patent application with an application number of 202110352206.0 and an application date of March 31, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.

technical field

The present disclosure relates to the technical field of image processing, and in particular, to a target detection method, apparatus, device, and storage medium.

Background technique

With the development of artificial intelligence technology, objects in images can be automatically detected, reducing labor costs and improving efficiency and accuracy.

SUMMARY OF THE INVENTION

The present disclosure provides a target detection method and apparatus, device and storage medium to solve the deficiencies in the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided a target detection method, including: acquiring position change information of at least one pixel in a first image relative to a corresponding pixel in a previous frame of image, where the first image is A frame of image in the video to be detected; obtain the image feature of the first image as the first feature; obtain the second feature based on the position change information; perform enhancement processing on the first feature based on the second feature, generating a fusion feature; determining a detection result of the target object in the first image according to the fusion feature.

According to a second aspect of the embodiments of the present disclosure, there is provided a target detection apparatus, comprising: a first acquisition module configured to acquire position change information of at least one pixel in the first image relative to a corresponding pixel in a previous frame of image , the first image is a frame of image in the video to be detected; the second acquisition module is used to acquire the image feature of the first image as the first feature and obtain the second feature based on the position change information; the fusion module , for performing enhancement processing on the first feature based on the second feature to generate a fusion feature; a detection module for determining the detection result of the target object in the first image according to the fusion feature.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the device includes a memory and a processor, the memory is used for storing computer instructions executable on the processor, the processor is used for executing the Computer instructions implement the method of the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method of the first aspect.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.

1 is a flowchart of a target detection method shown in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a first image and a previous frame image thereof shown in an embodiment of the present disclosure;

3 is a schematic diagram of position change information of a first image shown in an embodiment of the present disclosure;

4 is a schematic diagram of a process of target detection shown in an embodiment of the present disclosure;

5 is a schematic structural diagram of a target detection apparatus shown in an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

With the development of artificial intelligence technology, objects in images can be automatically detected, reducing labor costs and improving efficiency and accuracy. In the related art, when detecting an image frame of a video, it is completely consistent with the target detection of an ordinary image, but it does not fully utilize the features of the video, resulting in inaccurate detection results.

Based on this, a first aspect of the embodiments of the present disclosure provides a target detection method. Please refer to FIG. 1 , which shows a flow of the method, including steps S101 to S104 .

The object to be detected targeted by the target detection method may be an image or a video. When the object to be detected is a video, each frame of the video can be processed in batches, or each frame of the video can be processed sequentially. For convenience of description, this embodiment takes a certain frame of video as the object to be detected for description. The purpose of target detection is to detect the target object in the object to be detected to obtain the detection result, and the detection result can represent one or more aspects of the information of the target object (for example, the position, number, density and other information of the target object).

In step S101, the position change information of at least one pixel in the first image relative to the corresponding pixel in the previous frame of image is acquired, and the first image is a frame of image in the video to be detected. Wherein, at least one pixel in the first image corresponds to the same object as the corresponding pixel in the previous frame of image.

Wherein, the video to be detected may be a video recorded for a specific space, and the space may contain the target object and other objects at the same time. The first image and its previous frame can be as shown in Figure 2, and the first image can be any frame after the second frame image in the video to be detected (including the second frame image), because the first frame image may There will be no previous frame image.

In one example, the video to be detected may be a surveillance video or a drone video, that is, the video to be detected may be a video captured by a fixed surveillance camera, or a video captured by a flying drone. For example, the to-be-detected video to which the first image shown in FIG. 2 and its previous frame image belong is a street view video captured by a drone. The tiles containing target objects such as crowds in surveillance videos are often large in size, and the detection task of target objects such as crowds (such as counting people) is relatively simple; the tiles containing target objects such as people in drone videos are often very large in size. It is small, and detection by manual observation is prone to errors, and the above errors can be avoided by using the detection method provided in this embodiment.

In one example, the target object may be at least one of the following: a person, a vehicle, and an animal.

Among them, the position change between the corresponding pixels of the two frames of images corresponding to the same object may be caused by the objective movement of the object in the space corresponding to the video to be detected, or may be caused by the movement of video capture devices such as drones. , may also be caused by a combination of the above two reasons. Since the position change information can represent the position change of the corresponding pixels in the two frames of images, and each corresponding object in the two frames of images is composed of several consecutive pixels, the position change information of all the pixels of the same object can be the same. For example, the position change information of the pixel point in the first image shown in FIG. 2 relative to the corresponding pixel point in the previous frame image is shown in FIG. 3 . Those skilled in the art should understand that the above specific interpretation of the location change information is only for illustration, which is not limited by the embodiments of the present disclosure.

In this step, a pre-trained neural network may be used to obtain position change information. When training a neural network, a large number of video frames can be collected as samples, and the position change information of the corresponding pixels in these video frames can be used as labels, and then the output position change information (predicted value) can be compared by inputting the samples into the neural network to be trained. ) and the position change information (true value) as the label, the network loss value is obtained, and the network parameters of the neural network to be trained are further adjusted by the network loss value, and then through repeated iterations, continuous optimization, and finally the accuracy requirements are obtained. The trained neural network is completed. Those skilled in the art should understand that the above specific manner of acquiring the location change information is only for illustration, which is not limited in the embodiments of the present disclosure.

In step S102, an image feature of the first image is acquired as a first feature; and a second feature is acquired based on the position change information.

There is no restriction on the order of acquiring the first feature and acquiring the second feature, that is, the first feature may be acquired first, and then the second feature may be acquired, or the second feature may be acquired first, and then the first feature may be acquired, or The first feature and the second feature are acquired simultaneously.

In this step, a pre-trained neural network may be used to obtain the image feature of the first image as the first feature, for example, the VGG16_bn model may be used to extract the first feature. Those skilled in the art should understand that the above specific manner of acquiring the image feature of the first image is only for illustration, which is not limited in the embodiments of the present disclosure.

In this step, a pre-trained neural network may be used to obtain the second feature based on the position change information, for example, a backbone model may be used to extract the second feature. It should be understood by those skilled in the art that the above specific manner for obtaining the second feature is only for illustration, which is not limited by the embodiments of the present disclosure.

In addition, the first feature and the second feature may correspond to feature maps of the same size.

In step S103, the first feature is enhanced based on the second feature to generate a fusion feature.

Wherein, each object in the first image is different in one or more aspects (for example, the crowds, buildings, vehicles in the first image are different in external dimensions, etc.), and these differences will be reflected in the first image. In the first feature, the position change information can represent the difference in motion of each object (for example, the position of a person in the first image is point A, the position of the person in the previous frame image is point B, the person The position change information in the first image can be determined by the position change information of point A relative to point B; for another example, the position of a certain building in the first image is point C, and the position of the building in the previous frame of image The position is also point C, the position change information of the building in the first image can be determined by the position change information of point C relative to point C, that is, the movement of the building is static), the difference in the above movement will be reflected in in the second feature of the position change information. Therefore, using the second feature to enhance the first feature and generating the fusion feature can further strengthen the difference of each object reflected in the first feature, that is to say, the difference of each object embodied in the fusion feature will be more obvious and refined. .

The common method of feature fusion is to splicing two features to increase the number of channels, or adding two features to maintain the same number of channels after fusion. In one example, the second feature may be multiplied by the first feature as a mask to obtain a fused feature.

In step S104, the detection result of the target object in the first image is determined according to the fusion feature.

The target object may be one type of object in the first image (for example, a crowd), and the target object may also be a variety of objects in the first image (for example, crowd and traffic flow, or cattle, horses, and sheep); the target object It can be determined according to the user's selection, or it can be automatically determined according to a preset rule. The detection result can represent the information of the target object in one or more aspects (for example, the location, quantity, density, etc. of the target object), and the coverage of the detection result can be determined according to the user's choice, or can be automatically determined according to preset rules. Sure. Those skilled in the art should understand that the above specific definitions of the target object and the detection result are merely illustrative, and are not limited by the embodiments of the present disclosure.

In the embodiment of the present disclosure, the position change information of at least one pixel in the first image relative to the corresponding pixel in the previous frame image is obtained, and the first feature of the first image and the above position change information are obtained respectively. The second feature is to perform enhancement processing on the first feature based on the second feature to generate a fusion feature, and finally determine the detection result of the target object in the first image according to the fusion feature. Since the position change information between the corresponding pixels of two adjacent frames of images is used, the temporal information of the video is used, and the accuracy of the detection result can be increased.

Moreover, in the UAV video waiting for detection, the size of the target object is small, even if it is manually observed, it is difficult to avoid errors. However, the detection method in this embodiment uses position change information, and when generating fusion features, The first feature is enhanced, so the accuracy of the detection result is increased, that is, a relatively accurate detection result can be obtained.

In some embodiments of the present disclosure, the position change information includes optical flow information. Among them, the optical flow information represents the instantaneous speed of the pixel motion of the spatially moving object on the observation imaging plane. Therefore, when obtaining the optical flow information of the first image, the LK algorithm (Lucas Kanade algorithm) can be used to obtain it. The LK algorithm has great constraints on the video, such as constant brightness, short time between adjacent frames, and similar adjacent pixels. Therefore, the accuracy and efficiency of the LK algorithm are low. In order to obtain optical flow information more efficiently and with high accuracy, deep learning methods can also be used to obtain optical flow information. For example, FlowNet model or FlowNet2 model is used to obtain optical flow information.

Based on this, the first feature of the first image and the second feature of the position change information can be acquired in the following manner: acquiring the image feature in the first image as the first feature, and based on the The optical flow feature obtained from the optical flow information is used as the second feature.

The image feature can represent the feature of at least one dimension of the pixel point of the first image, and the optical flow feature can represent the position change rate of the pixel point of the first image.

In some embodiments of the present disclosure, the first feature may be enhanced based on the second feature in the following manner to generate a fusion feature: first, at least one image of the first image is determined according to the second feature The position change rate of the pixel point; next, for each pixel point in the at least one pixel point, the enhancement parameter of the target feature element is determined according to the position change rate of the pixel point, wherein the target feature element is the feature elements corresponding to the pixel points in the first feature; finally, based on each of the enhancement parameters, differential enhancement processing is performed on the target feature element corresponding to the first feature to generate a fusion feature.

Among them, the position change information can represent the difference in the movement speed of each object in the first image, and the difference in movement speed will be reflected in the second feature of the position change information, so the difference in movement speed between the target object and other objects will be In the second feature, for example, if the target object is a pedestrian, the movement speed of the target object is higher than that of other objects, such as buildings.

In an example, the pixels in the first image are divided into different sets of regions, each set of regions constitutes an object, and different objects move at different speeds, that is, different objects contain different positions of pixels with different rates of change. Therefore, the position change rate of different pixel points can be determined through the second feature, and the objects represented by the pixel points with different position change rates are different. Therefore, the enhancement parameter of the target feature element can be determined according to the position change rate of the pixel point, and further The target feature element is enhanced to obtain a fusion sub-feature of the fusion feature, in other words, a fusion sub-feature for the target feature element is obtained. Since the enhancement parameters of the feature elements corresponding to the pixels contained in different objects are different, the enhancement degrees of different feature elements are different, that is, the phenomenon of differential enhancement processing is performed on the feature elements in the first feature as a whole. The enhanced first feature forms a fused feature, or all fused sub-features can constitute a fused feature.

Among them, the enhancement parameter can indicate whether or not to enhance or the degree of enhancement, that is to say, the pixels of the target object and the pixels of other objects can be distinguished by whether or not to enhance or the degree of enhancement, so as to enhance the target object and other objects. difference in the first feature. For example, only the feature elements corresponding to the pixels of the target object may be enhanced, or the feature elements corresponding to the pixels of the target object may be enhanced to a higher degree, and the feature elements corresponding to other pixels may be enhanced to a lower degree. Further, the moving speed of the target object is larger than that of other objects, and accordingly, the position change rate of the pixel points in the target object is also larger than that of other objects. Therefore, it is possible to enhance only the feature elements corresponding to the pixels with a large position change rate, or to a large extent to enhance the feature elements corresponding to the pixels with a large position change rate, and to a lesser degree to enhance the feature elements corresponding to other pixels.

In an example, the enhancement parameter of the target feature element may be determined according to the position change rate of the pixel point and a preset standard change rate. For example, if the standard change rate is a threshold, the feature elements corresponding to the pixels whose position change rate is greater than the threshold value are enhanced, and the feature elements corresponding to the pixels whose position change rate is less than or equal to the threshold value are not enhanced. For another example, the standard change rate can be used as a reference value, and the enhancement degree of the feature element is determined according to the relationship between the position change rate of the pixel point and the reference value: the position change rate in response to the pixel point is equal to the standard change rate. , determine that the enhancement parameter of the target feature element is a preset standard enhancement parameter; or in response to the position change rate of the pixel point being greater than the standard change rate, determine that the enhancement parameter of the target feature element is greater than the standard enhancement parameter parameter; or in response to the position change rate of the pixel point being smaller than the standard change rate, it is determined that the enhancement parameter of the target feature element is smaller than the standard enhancement parameter.

In the embodiment of the present disclosure, the position change rate of the pixel point is determined by the second feature of the position change information, and according to the difference in the position change rate of the pixel point, the enhancement parameters of the feature elements corresponding to different pixel points are determined, and then the part of the pixel point is determined. The feature elements are enhanced, or all feature elements are enhanced to different degrees, thereby further strengthening the difference between the target object and other objects in the first feature, thereby increasing the accuracy and efficiency of the target object detection result.

In some embodiments of the present disclosure, the detection result of the target object in the first image may be determined according to the fusion feature in the following manner: first, a density map of the target object is generated according to the fusion feature; next, based on the density The figure refers to the number of density points of the target object (for example, summing the density points) to determine the number of target objects in the first image.

Wherein, the density map is used to indicate the position, quantity, density and other information of the target object in the first image, the density map has density points that refer to the target object, and the size of the density map can be related to the first feature and the first feature. The feature maps corresponding to the two features are of equal size. Therefore, the number of target objects can be determined according to the number of density points that refer to the target object in the density map, that is, the number of target objects can be determined by summing the density points.

Among them, a pre-trained neural network can be used to determine the density map. For example, a decoder model such as Stochastic Frontier Approach (SFA) can be used to determine the density map. This model can use multiple feature maps as input to extract different scales. Therefore, the determined density map is more accurate. Those skilled in the art should understand that the above specific manner of generating the density map is only for illustration, which is not limited in this embodiment of the present disclosure.

In an example, the video to be detected is a street view video to which the first image shown in FIG. 2 belongs, the target object is a person in the street view, and the number of pedestrians in the first image can be determined based on the above target detection method, that is The number of pedestrians at the time corresponding to the first image is determined. In specific applications, corresponding actions can be made according to the number of pedestrians. For example, when the number of pedestrians exceeds a preset number threshold, an alarm message can be issued to alert pedestrians and managers that the street is currently too crowded.

Due to the development of the economy, crowds gather more and more frequently. Therefore, the crowd count is used as the detection result, and then the alarm is carried out, which can prevent the occurrence of dangerous events such as stampede due to the dense crowd.

In the embodiment of the present disclosure, by generating a density map and then determining the number of target objects, that is, taking the number of target objects as the detection result, the accuracy and efficiency of the detection result can be further improved.

In some embodiments of the present disclosure, the quantity change information of the target objects in the video to be detected may also be generated in the following manner: First, obtain the first quantity information of the target objects in the first image, and obtain the target objects in the second image. Second quantity information of objects, wherein the first image and the second image are respectively a frame of images in the video to be detected; next, obtain the first time information of the first image and the second image second time information, wherein the first time information is the time of the first image in the video to be detected, and the second time information is the time of the second image in the video to be detected (For example, the first time information may be earlier or later than the second time information); finally, according to the first quantity information, the first time information, the second quantity information and the second time information, the quantity change information is determined, wherein , the quantity change information is used to indicate the quantity change of the target object in the video to be detected at different times.

The number of second images is not limited, and may be one or multiple, that is, the number of target objects of one frame of image may be obtained, and the number of target objects of multiple frames of images may also be obtained. Correspondingly, the subsequently acquired second time information may also be one or more, and the subsequently generated quantity change information may be for two images (a first image and a second image), or may be for multiple images. (a first image and at least two second images).

The method of acquiring the number of target objects in the second image (that is, the second number information) can be the same as the above-mentioned method of acquiring the number of target objects in the first image (that is, the first number information), or the method of acquiring the number of target objects in the first image (that is, the first number information) The number of the target objects in the first image is different, which is not intended to be specifically limited in this embodiment.

The time of the video to be detected can be a relative time, that is, the time relative to the moment when the video starts. For example, if the total duration of the video is 25 minutes, the time of the start moment of the video is 0:00, and the end moment of the video is 0:00. The time of the video to be detected is 00:25; the time of the video to be detected can also be an absolute time, that is, the absolute time of the video recording. For example, the total duration of the video is still 25min, and the video starts from The time of the start moment of the video is 2020.11.13.8:00, and the time of the end moment of the video is 2020.11.13.8:25.

In an example, the video to be detected is a street view video to which the first image shown in FIG. 2 belongs, and the target object is a person in the street view. Therefore, the number of pedestrians in the first image and at least one second image can be determined, and also It is to be able to determine the change in the number of pedestrians in the Street View video.

In the embodiment of the present disclosure, by acquiring the number of target objects in images of other frames in the video to be detected, and further combining the time information of each frame of images to generate the quantity change information of the video to be detected, it is possible to detect the video at the time corresponding to the video to be detected. Within the segment, the number changes and trends of the target objects are obtained, thereby further increasing the comprehensiveness of the detection results.

For example, for a commercial block, the change trend of the number of people in 12 months of the year can be obtained, so that people's consumption habits can be analyzed, and then the peak months and quarters of consumption (that is, peak consumption seasons), and the trough months and months of consumption can be obtained. Season (that is, consumption off-season); similarly, for this commercial block, the change trend of the number of people during business hours can also be obtained, so as to obtain the peak time and trough time of daily consumption. The information obtained above can be used as guidance data for business operation or property management, so as to achieve the purpose of scientific management.

For another example, for expressways, the change trend of traffic flow before and after holidays can be obtained, so that travel data can be counted, which can then be used as guidance data for expressway management.

In some embodiments of the present disclosure, the detection result of the target object in the first image may also be determined according to the fusion feature in the following manner, including: first, generating a density map of the target object according to the fusion feature; The position of each target object indicated in the density map and the preset area in the first image determine the number of target objects in the preset area in the first image.

The density map is used to indicate information such as the position, quantity, density, etc. of the target object in the first image, and the size of the density map may be equal to the size of the feature maps corresponding to the first feature and the second feature. For example, the density map may have the target objects in the first image, and label information such as position and/or count mark for each target object. Therefore, the number of target objects can be determined according to the positions of the target objects in the density map, that is, the number of target objects can be determined by summing the target objects in the density map.

Among them, the preset area can be an area where the flow of people is controlled, such as some places with limited flow, only a certain number of people are allowed to enter, and for example, some dangerous areas such as construction areas, pedestrians are prohibited from entering, that is, the flow of people needs to be controlled to 0 .

After the number of target objects in the preset area is determined, prompt information may be generated in response to the number of target objects in the preset area being greater than a preset number threshold. For example, if the flow of people in the restricted area exceeds the maximum flow of people required, an alarm will be issued to prohibit pedestrians from continuing to enter; for another example, after pedestrians enter the construction area, the alarm will be issued and pedestrians should be reminded to leave in time; another example, in some outdoor areas In a live game, the activity area of the players can be monitored, and if they enter the foul area, the alarm will be issued; for example, in football, basketball and other sports, the activity area of the players can be monitored, and if they enter the foul area, the Make an alarm.

In the embodiment of the present disclosure, the number of target objects in the preset area is used as the detection result, which can realize the detection and control of the flow of people in a specific area, which increases the pertinence and accuracy of the detection, thereby increasing the application range of the detection method. wider.

Please refer to FIG. 4 , which shows a process of object detection according to an embodiment of the present disclosure. Among them, the position change information is optical flow information, and the target detection result is a density map. The process is as follows: first perform optical flow prediction, then perform optical flow feature extraction and image feature extraction respectively, then perform feature fusion with optical flow features and image features to obtain fusion features, and finally use fusion features for density map prediction. In one embodiment, optical flow prediction is performed first, that is, the optical flow extraction network is used to extract optical flow information from the first image and the previous frame of the first image; then, from the extracted optical flow information, a neural network is used Extract optical flow features, and extract image features from the first image using a neural network (such as VGG16_bn), then multiply the optical flow features as a mask with the image features to obtain fusion features; finally, send the fusion features to the decoder (eg SFA) to predict density maps.

According to a second aspect of the embodiments of the present disclosure, a target detection apparatus is provided. Please refer to FIG. 5 , which shows a schematic structural diagram of the apparatus, including: a first acquisition module 501 for acquiring at least a The position change information of a pixel point relative to the corresponding pixel point in the previous frame image; the second acquisition module 502 is used to acquire the image feature of the first image as the first feature and obtain the second feature based on the position change information The fusion module 503 is used to carry out enhancement processing to the first feature based on the second feature to generate a fusion feature; the detection module 504 is used to determine the detection result of the target object in the first image according to the fusion feature .

In one embodiment, the position change information includes optical flow information, and the second obtaining module is configured to: use the optical flow feature obtained from the optical flow information as the second feature.

In one embodiment, the fusion module is configured to: determine the position change rate of at least one pixel of the first image according to the second feature; for each pixel in the at least one pixel, according to The position change rate of the pixel point determines the enhancement parameter of the target feature element, wherein the target feature element is the feature element corresponding to the pixel point in the first feature; The target feature element corresponding to the first feature is subjected to differential enhancement processing to generate the fusion feature.

In one embodiment, the fusion module is further configured to: determine the enhancement parameter of the target feature element according to the position change rate of the pixel point and a preset standard change rate.

In one embodiment, the fusion module is further configured to: in response to the position change rate of the pixel being equal to the standard change rate, determine that the enhancement parameter of the target feature element is a preset standard enhancement parameter; or In response to the position change rate of the pixel point being greater than the standard rate of change, it is determined that the enhancement parameter of the target feature element is greater than the standard enhancement parameter; or in response to the position change rate of the pixel point being smaller than the standard rate of change , it is determined that the enhancement parameter of the target feature element is smaller than the standard enhancement parameter.

In one embodiment, the detection module is configured to: generate a density map of the target object according to the fusion feature; determine the first image based on the number of density points that refer to the target object in the density map The first quantity information of the target object in .

In one embodiment, the detection module is further configured to: acquire second quantity information of the target object in a second image, where the second image is a frame of image in the video to be detected; acquire first time information and second time information, wherein the first time information is the time of the first image in the to-be-detected video, and the second time information is the time of the second image in the to-be-detected video. Detecting time in the video; generating quantity change information according to the first quantity information, the first time information, the second quantity information and the second time information, wherein the quantity change information is used to indicate The number of target objects in the video to be detected changes at different times.

In one embodiment, the detection module is configured to: generate a density map of the target object according to the fusion feature; determine the first image according to the position of each target object indicated in the density map The number of the target objects within the preset area in .

In one embodiment, the detection module is further configured to generate prompt information in response to the number of target objects in the preset area being greater than a preset number threshold.

Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method related to the first aspect, and will not be described in detail here.

A third aspect of the embodiments of the present disclosure provides an electronic device, please refer to FIG. 6 , which shows the structure of the device, and the device includes a memory and a processor, and the memory is used for storing data that can be stored on the processor. Running computer instructions, the processor is configured to detect a target based on the method of the first aspect when executing the computer instructions.

A fourth aspect of the embodiments of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in the first aspect.

In the present disclosure, the terms "first" and "second" are used for descriptive purposes only, and should not be construed as indicating or implying relative importance. The term "plurality" refers to two or more, unless expressly limited otherwise.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A target detection method, comprising:

Acquiring position change information of at least one pixel in the first image relative to the corresponding pixel in the previous frame of image, where the first image is a frame of image in the video to be detected;

acquiring the image feature of the first image as the first feature;

obtaining a second feature based on the position change information;

The first feature is enhanced based on the second feature to generate a fusion feature; and

The detection result of the target object in the first image is determined according to the fusion feature.
The target detection method according to claim 1, wherein the position change information includes optical flow information, and obtaining the second feature based on the position change information includes:

The optical flow feature obtained from the optical flow information is used as the second feature.
The target detection method according to claim 1 or 2, wherein the first feature is enhanced based on the second feature to generate the fusion feature, comprising:

determining a position change rate of at least one pixel of the first image according to the second feature;

For each pixel point in the at least one pixel point, the enhancement parameter of the target feature element is determined according to the position change rate of the pixel point, wherein the target feature element is the difference between the pixel point in the first feature and the pixel point. the corresponding feature element;

Based on each of the enhancement parameters, differential enhancement processing is performed on the corresponding target feature elements in the first feature to generate the fusion feature.
The target detection method according to claim 3, wherein determining the enhancement parameter of the target feature element according to the position change rate of the pixel, comprising:

The enhancement parameter of the target feature element is determined according to the position change rate of the pixel point and the preset standard change rate.
The target detection method according to claim 4, wherein determining the enhancement parameter of the target feature element according to the position change rate of the pixel point and a preset standard change rate, comprising:

In response to the position change rate of the pixel being equal to the standard change rate, determining that the enhancement parameter of the target feature element is a preset standard enhancement parameter; or

In response to the position change rate of the pixel point being greater than the standard change rate, determining that the enhancement parameter of the target feature element is greater than the standard enhancement parameter; or

In response to the position change rate of the pixel point being smaller than the standard change rate, it is determined that the enhancement parameter of the target feature element is smaller than the standard enhancement parameter.
The target detection method according to any one of claims 1 to 5, wherein determining the detection result of the target object in the first image according to the fusion feature comprises:

generating a density map of the target object according to the fusion feature;

First quantity information of the target object in the first image is determined based on the number of density points that refer to the target object in the density map.
The target detection method according to claim 6, further comprising:

acquiring second quantity information of the target object in a second image, wherein the second image is a frame of image in the video to be detected;

Obtain first time information and second time information, wherein the first time information is the time of the first image in the video to be detected, and the second time information is the time of the second image in the The time in the video to be detected;

Quantity change information is generated according to the first quantity information, the first time information, the second quantity information and the second time information, wherein the quantity change information is used to indicate that the video to be detected is in the The number of the target objects changes at different times.
The target detection method according to any one of claims 1 to 5, wherein determining the detection result of the target object in the first image according to the fusion feature comprises:

generating a density map of the target object according to the fusion feature;

According to the position of each of the target objects indicated in the density map, the number of the target objects in the preset area in the first image is determined.
The target detection method according to claim 8, further comprising:

In response to the number of the target objects in the preset area being greater than a preset number threshold, prompt information is generated.
A target detection device, comprising:

a first acquisition module, configured to acquire the position change information of at least one pixel in the first image relative to the corresponding pixel in the previous frame of image, where the first image is a frame of image in the video to be detected;

a second acquisition module, configured to acquire an image feature of the first image as a first feature and acquire a second feature based on the position change information;

a fusion module, configured to perform enhancement processing on the first feature based on the second feature to generate a fusion feature;

A detection module, configured to determine the detection result of the target object in the first image according to the fusion feature.
An electronic device, the device comprising a memory and a processor, the memory for storing computer instructions that can be executed on the processor, the processor for implementing when executing the computer instructions:

Acquiring position change information of at least one pixel in the first image relative to the corresponding pixel in the previous frame of image, where the first image is a frame of image in the video to be detected;

acquiring the image feature of the first image as the first feature;

obtaining a second feature based on the position change information;

The first feature is enhanced based on the second feature to generate a fusion feature;

The detection result of the target object in the first image is determined according to the fusion feature.
The electronic device according to claim 11, wherein the position change information includes optical flow information, and when acquiring the second feature based on the position change information, the processor implements when executing the computer instructions:

The optical flow feature obtained from the optical flow information is used as the second feature.
The electronic device according to claim 11 or 12, wherein, when the first feature is enhanced based on the second feature, and the fusion feature is generated, the processor implements when executing the computer instructions:

For each pixel point in the at least one pixel point, the enhancement parameter of the target feature element is determined according to the position change rate of the pixel point, wherein the target feature element is the difference between the first feature and the pixel point. the corresponding feature element;

Based on each of the enhancement parameters, differential enhancement processing is performed on the corresponding target feature elements in the first feature to generate the fusion feature.
The electronic device according to claim 13, wherein, when determining the enhancement parameter of the target feature element according to the position change rate of the pixel point, the processor implements when executing the computer instruction:

The enhancement parameter of the target feature element is determined according to the position change rate of the pixel point and the preset standard change rate.
The electronic device according to claim 14, wherein, when the enhancement parameter of the target feature element is determined according to the position change rate of the pixel point and the preset standard change rate, the processor executes the computer instruction When realized:

In response to the position change rate of the pixel being equal to the standard change rate, determining that the enhancement parameter of the target feature element is a preset standard enhancement parameter; or

In response to the position change rate of the pixel point being greater than the standard change rate, determining that the enhancement parameter of the target feature element is greater than the standard enhancement parameter; or

In response to the position change rate of the pixel point being smaller than the standard change rate, it is determined that the enhancement parameter of the target feature element is smaller than the standard enhancement parameter.
The electronic device according to any one of claims 11 to 15, wherein, when the detection result of the target object in the first image is determined according to the fusion feature, the processor implements when executing the computer instructions:

generating a density map of the target object according to the fusion feature;

First quantity information of the target object in the first image is determined based on the number of density points that refer to the target object in the density map.
The electronic device of claim 16, the processor, when executing the computer instructions, further implements:

acquiring second quantity information of the target object in a second image, wherein the second image is a frame of image in the video to be detected;

Obtain first time information and second time information, wherein the first time information is the time of the first image in the video to be detected, and the second time information is the time of the second image in the The time in the video to be detected;

Quantity change information is generated according to the first quantity information, the first time information, the second quantity information and the second time information, wherein the quantity change information is used to indicate that the video to be detected is in the The number of the target objects changes at different times.
The electronic device according to any one of claims 11 to 15, wherein, when the detection result of the target object in the first image is determined according to the fusion feature, the processor implements when executing the computer instructions:

generating a density map of the target object according to the fusion feature;

generating a density map of the target object according to the fusion feature;

According to the position of each of the target objects indicated in the density map, the number of the target objects in the preset area in the first image is determined.
19. The electronic device of claim 18, the processor, when executing the computer instructions, further implements:

In response to the number of the target objects in the preset area being greater than a preset number threshold, prompt information is generated.
A computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method according to any one of claims 1 to 9.
A computer program stored on a computer-readable medium, wherein the method of any one of claims 1 to 9 is implemented when the computer program is executed by a processor.