CN113140005B

CN113140005B - Target object positioning method, device, equipment and storage medium

Info

Publication number: CN113140005B
Application number: CN202110474194.9A
Authority: CN
Inventors: 杨昆霖; 李昊鹏; 刘诗男; 侯军; 伊帅
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2024-04-16
Anticipated expiration: 2041-04-29
Also published as: CN113140005A

Abstract

Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer-readable storage medium for locating a target object. At least two frames of target images including the image to be detected can be obtained from the continuously acquired multi-frame images, for each first pixel in the feature map of the image to be detected, a second pixel can be determined in the feature map of each frame of the target image based on the pixel position of the first pixel, the fusion weight of each second pixel is determined according to the similarity between the first pixel and the second pixel, the features of the second pixels are fused based on the fusion weights, the features of the pixels in the pixel position in the target feature map are obtained, and then the position of the target object in the image to be detected can be determined based on the target feature map. By fusing the information of the image to be detected and the adjacent images, the association between the adjacent images can be fully utilized, and the problem of positioning accuracy reduction caused by the interference of moving objects in the images is avoided.

Description

Target object positioning method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a target object positioning method, apparatus, device, and storage medium.

Background

In the fields of security protection, monitoring and the like, a target object in a video or an image is usually required to be positioned, then tracking, counting, behavior analysis and the like can be performed on the target object based on a positioning result, and an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result. Currently, when a target object in an image is located, a single frame image is input into a pre-trained neural network, and the position of the target object in the image is predicted through the neural network. However, because there may be moving objects in the image, the image may have bad phenomena such as blurring, which may cause interference to the positioning of the target object, and affect the positioning accuracy.

Disclosure of Invention

The disclosure provides a target object positioning method, device, equipment and storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided a target object positioning method, the method including:

acquiring at least two frames of target images from continuously acquired multi-frame images, wherein the at least two frames of target images comprise images to be detected;

for each first pixel in the feature map of the image to be detected, determining a second pixel in the feature map of the at least two frames of target images based on the pixel position of the first pixel;

Determining a fusion weight of the second pixel based on the similarity of the second pixel and the first pixel;

fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic map so as to determine the target characteristic map;

and determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map.

In some embodiments, the higher the similarity of the second pixel to the first pixel, the greater the fusion weight of the second pixel.

In some embodiments, determining the fusion weight of the second pixel based on the similarity of the second pixel to the first pixel comprises:

obtaining a first vector characterizing features of the first pixel and a second vector characterizing features of the second pixel;

and carrying out normalization processing on the product of the first vector and the second vector to obtain the fusion weight of the second pixel.

In some embodiments, fusing the second pixel based on the fusion weight to obtain a feature of the pixel position in the target feature map, so as to determine the target feature map, including:

Obtaining a second vector characterizing features of the second pixel;

and carrying out weighted summation on the second vector based on the fusion weight to obtain a third vector representing the characteristic of the pixel position in the target characteristic map so as to determine the target characteristic map.

In some embodiments, the second pixel comprises a pixel within a target pixel region, the target pixel region being a region surrounding the pixel location or a region adjacent to the pixel location in a feature map of the at least two frames of target images.

In some embodiments, the at least two frames of target images are derived based on:

performing downsampling processing on multi-frame images continuously acquired by an image acquisition device;

and extracting an image area including the target object in the image aiming at each frame of image obtained by downsampling to obtain one frame of target image in the at least two frames of target images.

In some embodiments, determining the location of the target object in the image to be detected based on the localization prediction of the target feature map comprises:

determining a positioning probability map corresponding to the target image according to the target feature map, wherein the positioning probability map is used for indicating the probability that pixel points in the target image are key points of the target object, and the key points are used for positioning the target object;

And determining the position of the key point in the target image based on the positioning probability map so as to determine the position of the target object in the target image.

In some embodiments, determining the location of the keypoint in the target image based on the localization probability map comprises:

carrying out mean value pooling treatment on the positioning probability map to obtain a first probability map;

carrying out maximum pooling treatment on the first probability map to obtain a second probability map;

and determining the pixel points with the same probability and larger than a preset threshold value in the first probability map and the second probability map as the key points.

In some embodiments, the method is implemented by a pre-trained neural network that is trained based on:

acquiring at least two frames of sample images from continuously acquired multi-frame images, wherein the at least two frames of sample images comprise target sample images carrying labeling information, the labeling information is used for indicating whether pixel points in the target sample images are key points of a target object or not, and the key points are used for positioning the target object;

inputting the at least two frames of sample images into a neural network to realize the following steps through the neural network:

For each pixel in the feature images of the target sample images, determining a target pixel in the feature images of the at least two frames of sample images based on the pixel position of the pixel; determining a fusion weight of the target pixel based on the similarity of the target pixel and the pixel; fusing the target pixels based on the fusion weights to obtain the characteristics of the pixel positions in a sample target characteristic map so as to determine the sample target characteristic map; determining a sample positioning probability map corresponding to the sample image based on the sample target feature map, wherein the sample positioning probability map is used for indicating the probability that each pixel point in the target sample image is a key point of a target object;

and constructing a loss function based on the difference of the sample positioning probability map and a real positioning probability map corresponding to the target sample image, and training the neural network by taking the loss function as an optimization target, wherein the real positioning probability map is determined based on the labeling information.

According to a second aspect of embodiments of the present disclosure, there is provided a target object positioning apparatus, the apparatus comprising:

the acquisition module is used for acquiring at least two frames of target images from the continuously acquired multi-frame images, wherein the at least two frames of target images comprise images to be detected;

The target feature map determining module is used for determining second pixels in the feature maps of the at least two frames of target images respectively according to the pixel positions of the first pixels aiming at each first pixel in the feature map of the image to be detected; determining a fusion weight of the second pixel based on the similarity of the second pixel and the first pixel; fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic map so as to determine the target characteristic map;

and the positioning module is used for determining the position of the target object in the image to be detected based on the positioning prediction of the target feature map.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, including a processor, a memory, and computer instructions stored in the memory and executable by the processor, where the processor executes the computer instructions to implement the method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed, implements the method of the first aspect described above.

In the embodiment of the disclosure, when a target object in an image is positioned, at least two frames of target images including an image to be detected can be obtained from a plurality of frames of continuously acquired images, for each first pixel in the image to be detected, a second pixel is determined in a feature map of each frame of the target image based on the pixel position of the first pixel, the fusion weight of each second pixel is determined according to the similarity between the first pixel and the second pixel, the features of the second pixels are fused based on the fusion weight, the features of the pixels in the pixel position in the target feature map are obtained, so that a target feature map is obtained, and the target feature map is obtained by fusing the information of the image to be detected and the adjacent images thereof, so that the target object in the image to be detected is positioned.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic diagram of predicting a center position of a human head through a neural network in accordance with an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a target object positioning method according to an embodiment of the present disclosure.

Fig. 3 (a) is a schematic diagram of a method for fusing multiple frame feature maps to obtain a target feature map according to an embodiment of the disclosure.

Fig. 3 (b) is a schematic diagram of a method for fusing multiple frame feature maps to obtain a target feature map according to an embodiment of the disclosure.

Fig. 4 is a schematic diagram of a structural schematic diagram of a neural network according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a method for fusing multiple frame feature maps to obtain a target feature map according to an embodiment of the disclosure.

Fig. 6 is a schematic diagram of a structural schematic diagram of a neural network according to an embodiment of the present disclosure.

Fig. 7 is a schematic logic structure diagram of a target object positioning device according to an embodiment of the disclosure.

Fig. 8 is a schematic diagram of a logic structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to better understand the technical solutions in the embodiments of the present disclosure and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

In the fields of security protection, monitoring and the like, a target object in a video or an image is usually required to be positioned, and then tracking, counting, behavior analysis and the like can be performed on the target object based on a positioning result, wherein an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result. When the target object in the image is positioned, the positioning of the target object can be realized by positioning the key points of the target object, for example, a neural network can be trained in advance, the neural network can output a key point positioning chart corresponding to the image, for example, the pixel points which are the key points in the image are marked as 1, the pixel points which are not the key points are marked as 0, and then the position of the target object in the image can be determined based on the key point positioning chart. Taking crowd positioning as an example, crowd positioning can be achieved by positioning the position of the center point of the head, as shown in fig. 1, an original image can be input into a neural network, the neural network can directly output a positioning chart of the center point of the head, the pixel point of the center point of the head in the positioning chart is marked as 1, and the pixel point of the center point of the head is not marked as 0.

Currently, when locating a target object in an image, a single frame image is typically input into a pre-trained neural network, through which the position of the target object in the image is predicted. However, because moving objects, such as moving people or objects, may exist in the image, the image may have bad phenomena such as blurring, and the like, which cause interference to the positioning of the target object, and seriously affect the positioning accuracy.

In order to improve positioning accuracy when positioning a target object in an image, the embodiment of the disclosure provides a positioning method for the target object, when positioning the target object in an image to be detected, one or more adjacent images can be combined at the same time to position the target object in the image to be detected, and a feature map of the adjacent images is fused with a feature map of the image to be detected to obtain a target feature map, wherein for each pixel in the image to be detected, one or more pixels can be determined in a neighborhood of a pixel position of the pixel in each adjacent image, and a fusion weight is determined based on similarity between the one or more pixels and the pixel of the image to be detected, and the feature of the pixel position in the target feature map is obtained by fusing the one or more pixels determined in each adjacent image based on the fusion weight. The position of the target object in the image to be detected may then be determined based on the localization prediction of the target feature map. By considering the time sequence relation between the image to be detected and other adjacent images and fusing the information of the multi-frame images to obtain the target feature image, the relevance between the images can be fully utilized, the problem of reduced positioning precision caused by the interference of moving objects in the images is avoided, and the positioning precision can be improved.

The method for detecting the target object in the embodiment of the disclosure may be performed by various electronic devices, for example, electronic devices such as a notebook computer, a server, a mobile phone, a tablet, and the like.

The target object of the embodiments of the present disclosure may be various objects that need to be identified from an image for positioning, for example, the target object may be a person, a vehicle, an animal, or the like. By the method, the target object in the image can be positioned, and further subsequent counting, tracking, behavior analysis and other processing can be performed on the target object in the image.

Specifically, as shown in fig. 2, the method comprises the following steps:

s202, acquiring at least two frames of target images from continuously acquired multi-frame images, wherein the at least two frames of target images comprise images to be detected;

s204, for each first pixel in the feature map of the image to be detected, determining a second pixel in the feature map of the at least two frames of target images based on the pixel position of the first pixel;

s206, determining fusion weights of the second pixels based on the similarity between the second pixels and the first pixels;

s208, fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic map so as to determine the target characteristic map;

S2010, determining the position of a target object in the image to be detected based on positioning prediction of the target feature map.

In step S202, at least two frames of target images may be acquired from multiple frames of images continuously acquired by the image acquisition device, for example, at least two frames of target images may be acquired from a video frame of a video acquired by the image acquisition device, where the at least two frames of target images include an image to be detected, the image to be detected is an image that needs to locate a target object in an image, and the image to be detected may be any one of the at least two frames of target images and may be preset as required, for example, may be a first acquired frame of the at least two frames of images, or an intermediate frame of the at least two frames of images, or a last frame of the at least two frames of images. The at least two frames of target images can be continuously acquired images or discontinuously acquired images, and in order to obtain more accurate positioning results when other images in the at least two frames of target images are used for assisting in positioning a target object, the content of the other images is preferably not completely the same as that of the image to be detected, but certain differences exist, and meanwhile, the content of the other images and the content of the image to be detected cannot be too different, but the content of the other images and the content of the image to be detected are ensured to be consistent, so the other images are preferably images which are separated from the image to be detected by a certain frame number, and the frame number of the interval is preferably controlled within a certain range so as to ensure that the image to be detected is close to the other images integrally but has certain differences.

In step S204, after at least two frames of target images are acquired, for each frame of target images in the at least two frames of target images, feature extraction may be performed on each frame of target images to obtain a feature map of each frame of target image, so as to obtain feature maps of at least two frames of target images. The feature extraction of the target image may be implemented by using a preset neural network, or may be implemented by using other modes, which is not limited by the embodiment of the disclosure. Wherein each frame of the feature map may comprise features of a plurality of channels, different channels may represent different types of features of the image, e.g. may be color features, contour features, etc. After obtaining the feature maps of the at least two frames of target images, for each pixel in the feature map of the image to be detected, the pixel position of the first pixel may be respectively determined to one or more pixels in the feature map of each frame of target image, and the pixel position of the first pixel may be referred to as a second pixel, where the second pixel may be a pixel with a higher degree of association with the first pixel, for example, one or more second pixels may be determined in a neighboring area surrounding or adjacent to the pixel position in the feature map of each frame of target image, and the second pixel may be a pixel at the pixel position in the feature map of the target image, or may be a plurality of pixels at the pixel position and its neighboring pixel position.

In step S206 and step S208, after determining the second pixels in the feature map of each frame of the target image based on the pixel positions, respectively, the fusion weights of the respective second pixels may be determined based on the similarity of the second pixels to the first pixels, where the similarity of the second pixels to the first pixels may be determined based on the similarity of the features of both. After determining the fusion weight of each second pixel, the second pixels can be fused based on the fusion weight, and the characteristics of the pixels at the pixel positions in the fused target characteristic diagram are obtained. For example, assuming that the first pixel is the first pixel of the first row and the first column in the image to be detected, the feature of the first pixel of the first row and the first column in the target feature map may be determined based on the above steps. The characteristics of the pixel positions in the target characteristic map can be determined by adopting the mode in sequence aiming at other pixel positions in the image to be detected, so that the whole target characteristic map is obtained through fusion.

For example, as shown in fig. 3 (a), it is assumed that there are three frames of target images, namely, an image a, an image B and an image C, wherein the image a is an image to be detected, and feature images corresponding to the three frames of target images are a ', B ', and C ', respectively. For a first pixel (such as a gray pixel in the figure) of a first row and a first column in a feature map a ' of an image to be detected, a second pixel can be determined in the feature maps a ', B ', and C ', respectively, the second pixel can be a pixel of the first row and the first column in the feature maps a ', B ', and C ', then a fusion weight of the second pixel can be determined according to similarity between the second pixel and the first pixel, and then features of the pixels of the first row and the first column in the feature maps a ', B ', and C ' are fused based on the respective corresponding fusion weights, so that features of the pixels of the first row and the first column in the target feature map D ' are obtained. The features of the pixels at other pixel positions in the target feature map may also be determined by the above method, so as to obtain features of the entire target feature map.

For another example, as shown in fig. 3 (B), for a first pixel (pixel P in the figure) of a first row and a first column in a feature map a ' of an image to be detected, a pixel region (such as a gray region in the figure) may be selected at each pixel position corresponding to the feature maps a ', B ', and C ', the pixels in the three pixel regions are all used as second pixels, and then fusion weights of the pixels in the three pixel regions are determined based on the similarity between the pixels in the three pixel regions and the pixel of the first row and the first column in the image to be detected (i.e., a '), and then features of the pixels in the three pixel regions are fused based on the fusion weights, so as to obtain features of the pixels located at the same pixel position as the pixel point P in the target feature map. The features of the pixels at other pixel positions in the target feature map can also be determined by adopting the method to obtain the features of the whole frame of the target feature map.

Of course, there are many ways to specifically determine the second pixel, and in specific applications, the second pixel may be set based on actual requirements.

In step S2010, after the target feature map is obtained, positioning prediction may be performed on the target feature map to determine a position of the target object in the image to be detected according to the target feature map, so as to implement positioning of the target object.

When the target object in the image to be detected is positioned, the target feature image is obtained by fusing the information of adjacent multi-frame images, the enhanced target feature image is obtained by fully utilizing the relevance between the adjacent images, and the target object is positioned based on the target feature image, so that the positioning result is more accurate.

In some embodiments, each of step S202 to step S2010 may be completed by a pre-trained neural network, for example, after at least two frames of target images are acquired, the target images may be input into the pre-trained neural network, feature maps of the target images are extracted by the neural network to obtain at least two frames of target images, then the neural network may determine, for each first pixel in the feature maps of the image to be detected, a second pixel in the feature maps of the target images based on a pixel position where the first pixel is located, and determine a fusion weight corresponding to the second pixel based on a similarity degree between the second pixel and the first phase pixel, fuse features of the second pixel based on the fusion weight, obtain features of the target feature maps at the pixel position, obtain a whole target feature map by a similar method, and the neural network may predict a position of the target object in the target image based on the obtained target feature map.

In some embodiments, the structure of the pre-trained neural network may be shown in fig. 4, and the pre-trained neural network includes a first sub-network, a second sub-network and a third sub-network, where the first sub-network is used for extracting features of a target image (e.g., a target image 1, a target image 2 and a target image 3 in the drawing) to obtain feature maps (e.g., feature maps 1, 2 and 3 in the drawing) corresponding to each frame of the target image, the second sub-network is used for aiming at each first pixel of the feature maps of the image to be detected (e.g., a target image 2 in the drawing), determining a second pixel in the feature maps of the target image based on the pixel position of the first pixel, determining a fusion weight corresponding to the second pixel in the feature maps of each target image based on the similarity between the first pixel and the second pixel, and fusing the features of the second pixel based on the fusion weights to obtain the features of the target feature maps at the pixel position. The third sub-network is used for predicting the position of the target object in the target image based on the target feature image, for example, the third sub-network can obtain a positioning probability image corresponding to the image to be detected based on the target feature image, the positioning probability image is used for indicating the probability that each pixel point in the image to be detected is a key point of the target object, the key point can be used for positioning the target object, and then the position of the target object in the image to be detected is determined according to the positioning probability image.

In some embodiments, the neural network may be trained by: at least two frames of sample images can be obtained from the continuously acquired multi-frame images, the at least two frames of sample images comprise target sample images carrying labeling information, the labeling information can be used for indicating whether pixel points in the target sample images are key points of a target object, and the key points can be used for positioning the target object. Based on the labeling information, a true positioning probability map of the neural network can be obtained and used for indicating the true probability that each pixel point in the target sample image is a key point. The neural network can determine, for each pixel in the feature map of the target sample image, a target pixel in the feature map of the at least two frames of sample images based on the pixel position of the pixel, then can determine the fusion weight of the target pixel according to the similarity between the target pixel and the pixel, and can fuse the target pixel based on the determined fusion weight to obtain the feature of the pixel position in the sample target feature map so as to determine the sample target feature map, and can determine the sample positioning probability map corresponding to the sample image based on the sample target feature map, wherein the sample positioning probability map is used for indicating the prediction probability of each pixel point in the target sample image as the key point of the target object, and can construct a loss based on the difference of the sample positioning probability map and the real positioning probability map corresponding to the target sample image, for example, can take the cross loss of the two as the loss, and train the neural network with the loss as the optimization target.

In order to predict a more accurate positioning result according to the fused target feature images, when the target feature images are fused through at least two frames of feature images, a final target feature image can be determined based on a local attention mechanism. When the target feature map is determined based on the local attention mechanism, the features of a certain pixel position in the target feature map can be obtained according to the feature fusion of a plurality of pixels in the adjacent pixel region of the pixel position in the feature map of the image to be detected and the feature fusion of a plurality of pixels in the adjacent region of the pixel position in the feature map of other target images. By considering the adjacent areas of the pixels to perform feature fusion, the target feature map not only fuses time sequence information among multiple frames of images, but also fuses space association information among adjacent pixels of the same frame of images, so that positioning accuracy is improved, and compared with a global attention mechanism, redundancy can be reduced, and calculation amount is reduced. Based on this, in step S204, when the second pixel is determined in the feature map of the target image based on the pixel position where the first pixel is located, in some embodiments, the target pixel area may be determined in the feature map of each target image based on the pixel position, respectively, with the pixels in the target pixel area as the second pixel. The target pixel area is a pixel area surrounding the pixel position or an adjacent area of the pixel position, so that the correlation between the second pixel and the first pixel is higher.

For example, as shown in fig. 5, for any pixel position (such as a pixel position P in the figure) in a feature map of an image to be detected, one or more target pixel regions (such as gray regions in the figure) may be determined in the feature map of each frame of the target image based on the pixel position, then a fusion weight of pixels in the target pixel region may be determined based on a similarity between pixels in the target pixel region and pixels in the pixel position in the image to be detected, and features of each pixel point in the target pixel region in the at least two frames of feature maps may be fused based on the fusion weight of each pixel point to obtain a feature corresponding to the pixel position in the target feature map.

In some embodiments, when determining the target pixel region in the feature map of the target image of each frame based on the pixel position, as shown in fig. 5, one target pixel region may be determined in the feature map centering on the pixel position. For example, an n×n square area may be determined with the pixel position as the center, or a circular area may be determined with the pixel position as the center, so long as the target pixel area is a neighboring area surrounding the pixel position, which may be specifically set according to actual requirements. Of course, the target pixel region may not be centered on the pixel position, and may include only the pixel position, for example.

In step S204, when determining the fusion weight of the second pixel based on the similarity between the second pixel and the first pixel, in some embodiments, a second vector characterizing the second pixel and a first vector characterizing the first pixel may be obtained, and then the product of the second vector and the first vector may be normalized to obtain the fusion weight of the second pixel. For example, the product of the first vector and the second vector may be determined and then input into the softmax function to output the fusion weight.

In some embodiments, the higher the degree of similarity of the second pixel to the first pixel, the greater the fusion weight of the second pixel. Since the higher the similarity between the second pixel and the first pixel, the more likely it is that both represent the same object in three-dimensional space, the greater the fusion weight corresponding to the second pixel should be.

In step S208, when the second pixels are fused based on the fusion weights to obtain the feature of the pixel position in the target feature map, in some embodiments, the second vectors representing the feature of the second pixels may be obtained, and then the second vectors are weighted and summed according to the fusion weights corresponding to the second pixels to obtain a third vector representing the feature of the pixel position in the target feature map, and in a similar manner, the third vector representing the feature of other pixel positions in the target feature map may be determined to obtain the whole target feature map.

Of course, in some embodiments, since the prediction and positioning result through the target feature map may be completed through the neural network, in order to reduce the network parameters of the neural network as much as possible, the number of channels corresponding to the target feature map obtained by fusion may be consistent with the number of channels corresponding to each frame of feature map in the feature maps of the at least two frames of target images, so as to avoid the increase in the number of channels of the target feature map and increase the network parameters of the neural network.

In some embodiments, downsampling may be performed on multiple frames of images continuously acquired by the image capturing device to obtain the at least two frames of target images, for example, assuming that 60 frames of images are captured by the image capturing device every second, then downsampling may be performed on the 60 frames of images to obtain 5 frames of images, then the 5 frames of images are taken as target images, and an intermediate frame in the 5 frames of images is taken as an image to be detected. By performing downsampling processing on the images acquired continuously by multiple frames, target images are obtained, so that most of the scene similarity among the target images can be ensured, and meanwhile, some differences can not exist, but are not identical. In some embodiments, in order to reduce the amount of calculation, save the calculation resources, improve the positioning efficiency of the target object in the image, when the target image is obtained from the multi-frame images continuously collected by the image collecting device, the area including the target object may be extracted from the original image collected by the image collecting device, so as to obtain the target image. Therefore, the area which does not comprise the target object in the original image can be cut off, and the calculated amount in the positioning process is reduced.

In some embodiments, in step S2010, when determining the position of the target object in the target image according to the target feature map, a positioning probability map corresponding to the target image may be determined according to the target feature map, where the positioning probability map is used to indicate a probability that a pixel point in the target image is a key point of the target object, where the key point may be a point in the target object that can be identified or represent the target object, for example, taking the target object as a task, where the key point may be a person 'S head center point, a person' S body center point, and so on. The location of the keypoints in the target image may then be determined based on the localization probability map to determine the location of the target object in the target image.

Of course, since the positioning probability map predicted by the neural network may have a certain noise, so that the prediction probability of an individual pixel point in the target image is relatively high, and thus the positioning probability map may be misjudged as a key point, in some implementations, in order to suppress the noise in the positioning probability map, the positioning probability map may be subjected to pooling processing, so as to reduce the interference of the noise. For example, the positioning probability map may be subjected to an average pooling process to obtain a first probability map, then the positioning probability map may be sequentially subjected to an average pooling process and a maximum pooling process to obtain a second probability map (of course, the first probability map may also be directly subjected to the maximum pooling process to obtain the second probability map), then the pixel points with the same probability in the first probability map and the second probability map are determined to be the target pixel points, whether the prediction probability of the target pixel points is greater than a preset threshold value is determined, and if so, the target pixel points are determined to be the key points.

For example, for a positioning probability map corresponding to a target image output by a neural network, an average pooling process may be performed on the positioning probability map by using a convolution kernel with a certain size and a step length (for example, a convolution kernel with a size of 3×3 and a step length of 1) to obtain an average pooled first probability map, then the average pooled first probability map is further subjected to a maximum pooling process by using a convolution kernel with a certain size and a step length (for example, a convolution kernel with a size of 3×3 and a step length of 1) to obtain a maximum pooled second probability map, then the first probability map and the second probability map are compared, a point with a consistent probability in the two maps is determined as a target pixel point, namely, a peak pixel point, and then whether the target pixel point is greater than a preset threshold value is determined, if so, the target pixel point is considered to be a key point. By the method, the influence of noise can be eliminated, and the peak pixel point can be accurately determined, so that the finally determined key point is more accurate.

After the position of the target object in the target image is determined according to the position of the key point, the positioning result can be output in the form of a key point positioning chart, for example, the pixel point which is the key point in the target image can be expressed as 1, the pixel point which is not the key point can be expressed as 0, the key point positioning chart is obtained, and the target object in the target image can be further subjected to subsequent processing such as counting, tracking and the like according to the key point positioning chart.

In order to further explain the positioning method of the target object in the embodiment of the present application, the following is explained in connection with a specific embodiment.

In the field of video monitoring, people in a monitoring video or image are usually required to be positioned so as to perform subsequent counting, tracking, behavior analysis and other processing on the people, an accurate positioning result is a key for ensuring the accuracy of a subsequent processing result, and at present, a single frame of image to be detected is usually input into a pre-trained neural network, and the position of the people in the image is predicted through the neural network. When moving objects are present in the image, such as human movements, the image may be blurred, affecting the accuracy of the final predicted result. Based on the above, the embodiment of the application provides a positioning method based on video, which inputs multi-frame images in a video segment into a neural network, then fuses feature images of the multi-frame images based on a local attention mechanism, and predicts crowd positions based on the fused feature images so as to improve positioning accuracy.

Specifically, the method comprises a neural network training stage and a neural network prediction stage.

The neural network training phase comprises the following steps:

1. The method is characterized in that a plurality of sections of crowd videos are collected, the scenes of the videos are diversified as much as possible, and the scenes can comprise places with large flow of people, such as squares, markets, subway stations, tourist attractions and the like. After the video is collected, 5 frames per second of downsampling processing is carried out on the video, and the downsampled video frames are cut, so that the interested crowd area is reserved. And then labeling the cut video frames, and labeling the positions of the centers of the heads in each frame of video frame.

Because for each person in the video frame, the user only marks one pixel point as the center point of the person head, the number of the center points of the person head in the image is small, and the training of the convolutional neural network is not facilitated. In order to obtain a better neural network training result, aiming at the head center point marked by the user, one or more adjacent pixel points can be marked on the head center point of the adult, so that a real positioning map Y for training the neural network is obtained. For example, for each video frame I (where the height and width of the video frame I are H, W respectively), the head center point of the user label in the video frame is(a _i Is the center of the headPoint coordinates, assuming the number of people in the figure is n). The true positioning map Y for training the convolutional neural network (where the height and width of the true positioning map are H, W respectively) can be determined according to equation (1), equation (2) and equation (3),

Wherein:

x is the coordinates in the image, x represents the convolution operation, K is the convolution kernel, e.g., k= [0,1,0;1, 1;0,1,0], n is the number of human heads, ai is the center point of the human heads, and delta (·) is a multiple impulse function, namely:

2. the three adjacent frames in the video frames obtained by the method are input into a convolutional neural network, wherein the structure of the convolutional neural network is shown in fig. 6, and the convolutional neural network comprises a feature extraction module, a local attention module and a positioning prediction module. One frame of the input adjacent three-frame images can be preset as an image to be detected, for example, an intermediate frame is used as the image to be detected. The feature extraction module can be 13 layers before the VGG-16 network pre-trained on the ImageNet, and three 512 channels and feature graphs with the size of original figures 1/8 can be obtained after feature extraction is carried out on three adjacent frames of video frames through the feature extraction module.

3. The three feature maps are input to a local attention module, where the local attention module typically has three inputs: query map (query map), key map (key map), and value map (value map). Wherein the query graph is a feature graph corresponding to the image to be detected, each of the three feature graphs can form a pair of key graphs and value graphs, that is, the key graphs and the value graphs are the same frame of feature graphs, thus forming 3 pairs in total Key graphs and value graphs. Suppose that the query graph is Q ε R ^h×w×c The key diagram and the value diagram areK ⁱ ,V ⁱ ∈R ^h×w×c Wherein h represents the height of the feature map, w represents the width of the feature map, c represents the number of channels of the feature map, N represents the logarithm of the key map and the value map, three feature maps exist, namely N is 3,i, 1-3 is taken, R represents the feature map, for each pixel position (x, y) in the query map Q, a square neighborhood N (x, y) = { (a, b) ||x-a|k and|y-b|k is generated, wherein (a, b) is the coordinates of the pixel points in the square neighborhood N, and k is the radius of the square neighborhood. The local attention output of the fused feature map at that pixel location (x, y) is then calculated by the following equation (4),

wherein, canTo characterize the transposed vector of the features of the pixel with pixel coordinates (x, y) in the query graph, +.>To characterize the vector of features of the pixel with coordinates (a, b) in the key map, the similarity of two pixels can be determined by solving for the inner product of the two vectors, and then further converting the inner product into a fusion weight by a softmax function. />To characterize the vector of the feature of the pixel with pixel coordinates (a, b) in the key map, by p +.>And (3) carrying out weighted summation to obtain a vector representing the characteristic of the pixel with the pixel coordinates of (x, y) in the target characteristic diagram.

And repeatedly utilizing the formula to obtain the local attention output of all pixel positions of all the fused characteristic diagrams. The local attention module may output a fused feature map of the same size as the original feature map (1/8 of the original image).

4. And inputting the fused characteristic diagram to a positioning prediction module. The positioning prediction module firstly uses a three-layer convolution neural network (the convolution kernel size is 3, the void rate is 2, the channel numbers are 512) to further extract the characteristics of the fused characteristic graphs, then uses three transposition convolutions (the convolution kernel size is 4, the step length is 2, the channel numbers are 256, 128 and 64 respectively) to convert the fused characteristic graphs into the original graph size, accesses a common convolution layer (the convolution kernel size is 3, the void rate is 2, the channel numbers are 256, 128 and 64 respectively) after each transposition convolutions, realizes the characteristic extraction, and finally uses 1X 1 convolutions to convert the channel numbers of the characteristic graphs into 1 to obtain the positioning probability graphAssume that the predicted localization probability map is +.>The true positioning map is Y. The localized cross entropy loss +.can be calculated according to equation (5)>

Where λ is the positive sample weight, responsible for balancing the positive and negative samples, and may be set to 100.

5. Optimizing network parameters by using random gradient descent after obtaining the loss function, and assuming that the network parameters are theta at the ith step _i The network parameter theta at the (i+1) th step is calculated by the following formula (6) _i+1 ：

Wherein γ is the learning rate, and is set to 0.0001. The above steps are repeated until the network parameters are not changed.

The neural network prediction stage is specifically as follows:

after preprocessing such as downsampling crowd video to be detected and cutting out crowd areas of interest, inputting three adjacent frames in video frames obtained by preprocessing into a trained convolutional neural network, and outputting a predicted positioning probability map of an intermediate frame by the convolutional neural network.

And then carrying out the following non-maximum suppression steps on the positioning probability map to obtain a final key point positioning map:

firstly carrying out average pooling with a kernel size of 3 and a step length of 1 on a predictive probability map to inhibit noise, and then carrying out maximum pooling operation with the kernel size of 3 and the step length of 1 on the pooled probability map; and comparing the average pooling graph with the maximum pooling graph, taking the pixel points with the same probability in the two frames of graphs as target pixel points, and finally comparing the target pixel points with a preset threshold value, wherein the pixel with the value larger than the preset threshold value is set as 1, and otherwise, the pixel with the value larger than the preset threshold value is set as 0, so as to obtain a final key point positioning graph. The location of the crowd in the image can be determined based on the keypoint localization map.

By using multi-frame video images to locate people, time sequence information in the video images can be mined, and compared with the traditional crowd locating precision based on single image, the method has the advantage that the crowd locating precision is higher. Meanwhile, by using a local self-attention mechanism and fusing the space-time information of the feature images of the multi-frame video images, the local relevance among pixels in the video images is captured, so that more sufficient information can be mined, and even if a moving object exists in the video images, a better positioning effect can be obtained.

The embodiment of the present disclosure further provides a target object positioning device, as shown in fig. 7, the target object positioning device 70 includes:

an acquisition module 71, configured to acquire at least two frames of target images from continuously acquired multi-frame images, where the at least two frames of target images include an image to be detected;

a target feature map determining module 72, configured to determine, for each first pixel in the feature map of the image to be detected, a second pixel in the feature maps of the at least two frames of target images based on a pixel position where the first pixel is located; determining a fusion weight of the second pixel based on the similarity of the second pixel and the first pixel; fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic map so as to determine the target characteristic map;

A positioning module 73, configured to determine a position of the target object in the image to be detected based on the positioning prediction of the target feature map.

In some embodiments, the target feature map determining module is configured to, when determining the fusion weight of the second pixel based on the similarity between the second pixel and the first pixel, specifically:

In some embodiments, the target feature map determining module is configured to fuse the second pixel based on the fusion weight to obtain a feature of the pixel position in the target feature map, so as to determine the target feature map, where the target feature map determining module is specifically configured to:

obtaining a second vector characterizing features of the second pixel;

In some embodiments, the positioning module is configured to determine, based on a positioning prediction of the target feature map, a position of a target object in the image to be detected, specifically configured to:

In some embodiments, the positioning module is configured to, when determining the location of the keypoint in the target image based on the positioning probability map, specifically:

The embodiment of the disclosure further provides an electronic device, as shown in fig. 8, where the electronic device includes a processor 81, a memory 82, and computer instructions stored in the memory 82 and executable by the processor 81, and when the processor 81 executes the computer instructions, the method in any of the embodiments is implemented.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, and these improvements and modifications should also be considered as protective scope of the embodiments of this disclosure.

Claims

1. A method of locating a target object, the method comprising:

for each first pixel in the feature map of the image to be detected, determining a second pixel in the feature map of the at least two frames of target images based on the pixel position of the first pixel; the second pixels comprise pixels in target pixel areas, and the target pixel areas are areas surrounding the pixel positions or adjacent areas of the pixel positions in the feature images of the at least two frames of target images;

determining a fusion weight of the second pixel based on the similarity between the second pixel and the first pixel, wherein the higher the similarity between the second pixel and the first pixel is, the larger the fusion weight of the second pixel is;

2. The method of claim 1, wherein determining the fusion weight of the second pixel based on the similarity of the second pixel to the first pixel comprises:

3. The method according to any one of claims 1-2, wherein fusing the second pixel based on the fusion weight to obtain a feature of the pixel location in a target feature map, to determine the target feature map, includes:

obtaining a second vector characterizing features of the second pixel;

4. The method according to any of claims 1-2, wherein the at least two frames of target images are derived based on:

5. The method of claim 1, wherein determining the location of the keypoint in the target image based on the localization probability map comprises:

6. The method according to claim 1, characterized in that it is implemented by means of a pre-trained neural network, which is trained on the basis of:

and determining a loss based on the difference of the sample positioning probability map and a real positioning probability map corresponding to the target sample image, and optimizing the neural network based on the loss, wherein the real positioning probability map is determined based on the labeling information.

7. A target object positioning apparatus, the apparatus comprising:

the target feature map determining module is used for determining second pixels in the feature maps of the at least two frames of target images respectively according to the pixel positions of the first pixels aiming at each first pixel in the feature map of the image to be detected; determining a fusion weight of the second pixel based on the similarity of the second pixel and the first pixel; fusing the second pixels based on the fusion weight to obtain the characteristics of the pixel positions in the target characteristic map so as to determine the target characteristic map; the second pixels comprise pixels in target pixel areas, and the target pixel areas are areas surrounding the pixel positions or adjacent areas of the pixel positions in the feature images of the at least two frames of target images; the higher the similarity between the second pixel and the first pixel is, the larger the fusion weight of the second pixel is; the positioning module is used for determining a positioning probability map corresponding to the target image according to the target feature map, wherein the positioning probability map is used for indicating the probability that pixel points in the target image are key points of the target object, and the key points are used for positioning the target object; and determining the position of the key point in the target image based on the positioning probability map so as to determine the position of the target object in the target image.

8. An electronic device comprising a processor, a memory, a computer program stored in the memory for execution by the processor, the processor implementing the method of any of claims 1-6 when executing the computer program.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed, implements the method according to any of claims 1-6.