CN111914809A

CN111914809A - Target object positioning method, image processing method, device and computer equipment

Info

Publication number: CN111914809A
Application number: CN202010836717.5A
Authority: CN
Inventors: 谷月阳; 彭瑾龙; 王亚彪; 汪铖杰; 李季檩; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-10

Abstract

The application relates to a target object positioning method, an image processing device and computer equipment. The method comprises the following steps: acquiring a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence of the current image; determining cross-correlation features between the current image and the reference image; determining the position of the central point of the target object in the current image according to the cross-correlation characteristics; determining the size of the target object in the current image according to the cross-correlation characteristic; based on the center point location and the size, the target object is located in the current image. By adopting the method, the target object positioning efficiency can be improved.

Description

Target object positioning method, image processing method, device and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a target object positioning method, an image processing apparatus, and a computer device.

Background

With the rapid development of computer technology and image processing technology, more and more fields relate to the need of positioning a target object in an image during image processing so as to perform subsequent applications according to the positioned target object. For example, a typical target tracking technology has very important applications in the fields of security monitoring, smart cities, unmanned driving and the like.

In a conventional target object positioning method, a priori frames are usually introduced, a plurality of priori frames are predicted for a target, and one of the a priori frames is selected as a final positioning result. This results in an excessive amount of computation, which affects the efficiency of target object location.

Disclosure of Invention

In view of the above, it is necessary to provide a target object positioning method, an image processing method, an apparatus, and a computer device capable of improving positioning efficiency in view of the above technical problems.

A method of locating a target object, the method comprising:

acquiring a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence where the current image is located;

determining cross-correlation characteristics between the current image and the reference image;

determining the position of the central point of a target object in the current image according to the cross-correlation characteristics;

determining the size of a target object in the current image according to the cross-correlation characteristics;

based on the center point position and size, the target object is located in the current image.

A target object locating apparatus, the apparatus comprising:

the acquisition module is used for acquiring a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence where the current image is located;

the determining module is used for determining the cross-correlation characteristics between the current image and the reference image;

the determining module is further used for determining the position of the central point of the target object in the current image according to the cross-correlation characteristics;

the determining module is further used for determining the size of the target object in the current image according to the cross-correlation characteristics;

and the positioning module is used for positioning the target object in the current image based on the position and the size of the central point.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The target object positioning method, the target object positioning device, the computer equipment and the computer readable storage medium acquire the current image and the reference image, firstly determine the cross-correlation characteristic between the current image and the reference image, then determine the central point position of the target object in the current image according to the cross-correlation characteristic, determine the size of the target object in the current image according to the cross-correlation characteristic, and then position the target object in the current image based on the central point position and the size, so that only one central point position and one size are output in the current image to position the area where the target object is located, the tedious operation of outputting a plurality of candidate frames and further selecting the target frame from the plurality of candidate frames in the traditional target object positioning method is avoided, the calculated amount is reduced, and the target object positioning speed is improved.

A method of image processing, the method comprising:

acquiring an image sample pair, a position prediction network and a size prediction network; the image sample pair comprises a target sample and a reference sample; the target sample and the reference sample comprise the same sample object; the training label of the target sample is used for representing the labeling area of the sample object in the target sample;

determining cross-correlation features between the target sample and the reference sample;

obtaining the position of a predicted central point of the sample object in the target sample according to the cross-correlation characteristics through a position prediction network;

obtaining the predicted size of the sample object in the target sample according to the cross-correlation characteristics through a size prediction network;

locating a prediction region of the sample object in the target sample based on the prediction center point position and the prediction size;

training a position prediction network and a size prediction network according to the direction of the difference between the minimum prediction region and the marked region; and the position prediction network and the size prediction network obtained by training are jointly used for searching the position of the target object in the target image.

An image processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring an image sample pair, a position prediction network and a size prediction network; the image sample pair comprises a target sample and a reference sample; the target sample and the reference sample comprise the same sample object; the training label of the target sample is used for representing the labeling area of the sample object in the target sample;

a determining module for determining cross-correlation characteristics between the target sample and the reference sample;

the prediction module is used for obtaining the position of a predicted central point of the sample object in the target sample according to the cross-correlation characteristics through a position prediction network;

the prediction module is also used for obtaining the predicted size of the sample object in the target sample according to the cross-correlation characteristics through a size prediction network;

a positioning module for positioning a prediction region of the sample object in the target sample based on the prediction center point position and the prediction size;

the training module is used for training the position prediction network and the size prediction network according to the direction of the difference between the minimum prediction area and the labeling area; and the position prediction network and the size prediction network obtained by training are jointly used for searching the position of the target object in the target image.

The image processing method, the device, the computer equipment and the computer readable storage medium obtain the cross-correlation characteristics between the target sample and the reference sample, firstly obtain the predicted central point position of the sample object in the target sample according to the cross-correlation characteristics through the position prediction network, obtain the predicted size of the sample object in the target sample according to the cross-correlation characteristics through the size prediction network, then position the predicted area of the sample object in the target sample based on the predicted central point position and the predicted size, and train the position prediction network and the size prediction network according to the direction of minimizing the difference between the predicted area and the marked area, so that when the position prediction network and the size prediction network obtained by training position the target object in the target image, only one central point position and one size are output in the target image to position the area where the target object is located, the complicated operation that a plurality of candidate frames are output and then the target frame is selected from the candidate frames in the traditional target object positioning method is avoided, the model calculation amount is reduced, and the target object positioning speed is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a target object location method;

FIG. 2 is a flowchart illustrating a method for locating a target object according to one embodiment;

FIG. 3 is a diagram of a reference image in one embodiment;

FIG. 4 is a diagram of a current image in one embodiment;

FIG. 5 is a schematic illustration of center points and dimensions in one embodiment;

FIG. 6 is a schematic diagram of an interface for target object location in one embodiment;

FIG. 7 is a diagram illustrating the acquisition of a center point in one embodiment;

FIG. 8 is a diagram illustrating the structure of a target object location model in one embodiment;

FIG. 9 is a diagram showing the structure of a target object localization model in another embodiment;

FIG. 10 is a diagram showing the structure of a target object localization model in yet another embodiment;

FIG. 11 is a schematic diagram of an interface for target object positioning in another embodiment;

FIG. 12 is a flowchart illustrating a target object locating method according to another embodiment;

FIG. 13 is a flowchart illustrating an image processing method according to an embodiment;

FIG. 14 is a block diagram of a target object locating device in one embodiment;

FIG. 15 is a block diagram showing the configuration of an image processing apparatus according to an embodiment;

FIG. 16 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies such as machine learning of artificial intelligence and the like, and is specifically explained by the following embodiment:

the target object positioning method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 acquires a current image and a reference image, and sends the current image and the reference image to the server 104, the server 104 firstly determines a cross-correlation characteristic between the current image and the reference image, then determines a center point position of a target object in the current image according to the cross-correlation characteristic, determines a size of the target object in the current image according to the cross-correlation characteristic, and finally positions the target object in the current image based on the center point position and the size.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud storage, network services, cloud communication, big data, and an artificial intelligence platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The image processing method provided by the application can be applied to the application environment shown in fig. 1. The method comprises the steps that a terminal 102 obtains an image sample pair, a position prediction network and a size prediction network, and uploads the image sample pair, the position prediction network and the size prediction network to a server 104, wherein the image sample pair comprises a target sample and a reference sample, the target sample and the reference sample comprise the same sample object, and a training label of the target sample is used for representing an labeled area of the sample object in the target sample; the server 104 firstly determines a cross-correlation characteristic sample between the target sample and the reference sample; secondly, inputting the cross-correlation characteristic samples into a position prediction network to obtain the position of a predicted central point of the sample object in the target sample; inputting the cross-correlation characteristic sample into a size prediction network to obtain the predicted size of the sample object in the target sample; and finally, training a position prediction network and a size prediction network based on the predicted central point position, the predicted size and the training label, wherein the position prediction network and the size prediction network obtained by training are respectively used for predicting the central point position and the size of the target object in the target image so as to position the target area of the target object in the target image.

In an embodiment, as shown in fig. 2, a target object positioning method is provided, and this embodiment is mainly illustrated by applying the method to the computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:

step 202, acquiring a current image and a reference image; the reference image is an image of a target object included in one frame before the current image in the image frame sequence in which the current image is located.

The current image and the reference image are both one frame of image in the image frame sequence. The image frame sequence is a set of a series of images in chronological order, and specifically may be a video frame sequence in a video, or may be a plurality of frames of images continuously acquired by an image acquisition device. The current image and the reference image comprise the same target object, the target object in the current image is to be positioned, and the target object in the reference image is positioned. The target object is an object that can be tracked in the image frame sequence, and specifically may be an independent living body or object, such as a natural person, an animal, a vehicle, a virtual character, and the like, or a specific part, such as a face, a hand, and the like.

It can be understood that, since the current image and the reference image are from the same image frame sequence, and the reference image is an image frame processed before the current image, the target object to be tracked can be selected by the reference image, and the target object is positioned in the image frame subsequent to the reference image, so as to realize the tracking of the target object in the image frame sequence.

Specifically, the reference image may be a first frame image of the image frame sequence, or may be an intermediate frame image of the image frame sequence. When the reference image is the first frame image, the target object can be positioned based on user operation, so that the target object to be tracked can be acquired by the subsequent image frame through the reference image. And when the reference image is an intermediate frame image, the reference image may be a previous frame image of the current image, so that the current image may acquire the target object to be tracked through the target object positioned in the previous frame image.

In a specific embodiment, the reference image may be a complete image of one frame in the image frame sequence, or may be a body region cut out from the reference image based on the position of the target object. And taking the position of the target object as the center, and taking the area in the designated range as the main body area, so that the target object to be tracked can be quickly acquired by the subsequent image frame. For example, fig. 3 (a) and (b) may both be reference images, and (b) is an image obtained by cutting out the main body area 302 of (a).

In a specific embodiment, the current image may be a complete image of one frame in the image frame sequence, or may be a region of interest selected according to a tracking result of a previous image. Considering that the position change of the target object between the front frame and the rear frame is small, the position of the target object in the image of the previous frame is taken as the center, and the current image is intercepted according to the specified search range to obtain the region of interest so as to reduce the search range of the current image. For example, fig. 4 (a) and (b) may both be the current image, and (b) is the selected region of interest 402 in (a).

Specifically, a target object positioning application runs on the computer device, the computer device can start the target object positioning application according to user operation, and the target object positioning application acquires an image frame sequence and extracts a reference image and a current image from the image frame sequence.

Step 204, cross-correlation features between the current image and the reference image are determined.

Wherein the cross-correlation features can be used to characterize the degree of similarity between the two frame images.

In the application, the target tracking problem is converted into a similarity comparison problem. After the reference image is obtained, similarity or difference comparison is carried out between a subsequent image frame of the reference image and a first region where the target object is located in the reference image, so that a second region, the similarity or difference of which meets the screening condition with the first region, is screened in the subsequent image frame, and the second region is the region where the target object is located in the subsequent image frame.

In one embodiment, step 204 includes: acquiring a first image characteristic of a current image and a second image characteristic of a reference image; and performing correlation operation on the first image characteristic and the second image characteristic to obtain a cross-correlation characteristic between the first image characteristic and the second image characteristic.

Wherein the cross-correlation feature between the current image and the reference image may be a cross-correlation feature between a first image feature of the current image and a second image feature of the reference image, the cross-correlation feature may be quantized by performing a correlation operation on the first image feature and the second image feature. In this case, the cross-correlation feature may also be used to characterize the degree of similarity between the first image feature and the second image feature.

Note that the second image feature may be an image feature extracted from the first region. When the reference image is a main body area, the second image features are directly extracted from the main body area, and when the reference image is a frame of complete image, the main body area can be firstly intercepted based on the position of the target object, and then the second image features are extracted from the main body area.

In particular, the image features may include texture features, color features, gradient features, spatial relationship features, and the like. Texture features describe the surface properties of objects in an image. The color features describe the color of each object in the image. The gradient features describe the shape and structure of objects in the image. The spatial relationship characteristic refers to the mutual spatial position or relative direction relationship among a plurality of targets segmented from the image, and these relationships can also be divided into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like.

Specifically, after acquiring the reference image and the current image, the computer device acquires a first image feature of the current image and a second image feature of the reference image respectively, and performs correlation operation on the first image feature and the second image feature to obtain a cross-correlation feature. The cross-correlation characteristic can be calculated by the following formula:

R＝φ(Z)×φ(X)

wherein R is a cross-correlation feature; phi (X) is a first image feature; phi (Z) is a second image feature.

In the embodiment, a similarity measurement mode is introduced by skillfully taking advantage of the correlation operation in the field of signal processing, so that the similarity between two image characteristics can be effectively and conveniently measured.

And step 206, determining the position of the central point of the target object in the current image according to the cross-correlation characteristics.

Wherein the center point position is a position where a center point of the target object is located. The center point of the target object may be a center point of a smallest bounding box that the target object can enclose, and the bounding box may be a polygon, a circle, and the like, and the center point of the bounding box may be, for example, a diagonal intersection of the polygon, a center of the circle, and the like. For example, referring to fig. 5, taking the virtual object as an example, the central point may be a diagonal intersection 502 of a minimum quadrangle that can be enclosed by the virtual object.

Specifically, the computer device determines the position of the center point of the target object in the current image through the cross-correlation features.

And step 208, determining the size of the target object in the current image according to the cross-correlation characteristics.

The size of the target object may be the size of the smallest enclosure frame that the target object can enclose, such as the width and height of a quadrangle, the diameter of a circle, and the like. With continued reference to FIG. 5, it can be seen that the dimensions may be the width 504 and height 506 of the smallest quadrilateral that the virtual object can enclose.

Specifically, the computer device determines the size of the target object in the current image based on the cross-correlation features.

Step 210, a target object is positioned in the current image based on the center point position and size.

Specifically, the computer device determines a bounding box constructed by the position and the size of the central point in the current image, and the area in the bounding box is the area where the target object is located.

For example, referring to FIG. 6, it can be seen that a center point location 602 and a size (width 604, height 606) are output in the current image to locate the region where the target object is located.

In the target object positioning method, the current image and the reference image are obtained, the cross-correlation characteristic between the current image and the reference image is firstly determined, then the central point position of the target object in the current image is determined according to the cross-correlation characteristic, the size of the target object in the current image is determined according to the cross-correlation characteristic, and the target object is positioned in the current image based on the central point position and the size, so that only one central point position and one size are output in the current image to position the area where the target object is located, the complicated operation of outputting a plurality of candidate frames and further selecting the target frame from the plurality of candidate frames in the traditional target object positioning method is avoided, the model calculation amount is reduced, and the target object positioning speed is improved.

In one embodiment, determining the position of the center point of the target object in the current image according to the cross-correlation features comprises: generating a pixel point classification graph and a pixel point position distribution graph according to the cross-correlation characteristics; the pixel points in the pixel point classification graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the current image; pixel points belonging to the target category in the pixel point position distribution diagram have pixel values representing the distance between the pixel points and the center point of the target object, and correspond to the pixel points in the current image; and selecting the central point of the target object based on the pixel point classification graph and the pixel point position distribution graph to obtain the central point position of the target object in the current image.

Wherein the cross-correlation feature may be a cross-correlation feature map.

The pixel point classification graph is used for describing classification categories to which all the pixel points in the cross-correlation characteristic graph belong. The classification category may include a target category and a non-target category, and when a pixel belongs to the target category, it indicates that the pixel is a pixel constituting the target object, and when a pixel belongs to the non-target category, it indicates that the pixel is not a pixel constituting the target object. The pixels in the pixel classification map have pixel values representing the classification categories to which the pixels belong, the pixel value of the target category may be a non-zero value, for example, 1, and the pixel value of the non-target category may be 0.

And the pixel position distribution map is used for describing the distribution of the pixels belonging to the target category in the cross-correlation characteristic graph around the central point of the target object. The pixel points belonging to the target category in the pixel point position distribution map have pixel values representing distances from the center point of the target object. The pixel value magnitude is inversely related to the distance between the center points of the target objects. For example, the farther a pixel is from the center point of the target object, the smaller the pixel value.

Specifically, the computer device generates a pixel point classification graph and a pixel point position distribution graph according to the cross-correlation characteristic graph, selects a central point of the target object based on the pixel point classification graph and the pixel point position distribution graph, and outputs the central point to the current image to obtain the position of the central point of the target object in the current image.

In one embodiment, selecting a center point of the target object based on the pixel point classification map and the pixel point position distribution map to obtain a position of the center point of the target object in the current image includes: multiplying the pixel point classification graph and the pixel point position distribution graph in a bit-by-bit manner to obtain a prediction result graph; selecting a pixel point with the maximum pixel value as a central point of a target object from the prediction result image; and determining the central point mapping to the target position of the current image to obtain the central point position of the target object in the current image.

The pixel point classification graph and the pixel point position distribution graph are obtained based on the cross-correlation characteristic graph, and the resolution ratios of the pixel point classification graph and the pixel point position distribution graph are the same. The bitwise multiplication refers to the multiplication of the pixel point classification graph and the pixel value at the same position in the pixel point position distribution graph.

For example, referring to fig. 7, fig. 7 is a schematic diagram of acquiring a center point in one embodiment. It can be seen that the pixel classification map and the pixel position distribution map are multiplied by each other in a bitwise manner to obtain a prediction result map, in the prediction result map, the pixel value of the pixel of the non-target category is 0, and the pixel value of the pixel closer to the central point is larger, so that the pixel with the largest pixel value is the central point of the target object.

In this embodiment, the pixel point classification map and the pixel point position distribution map are generated according to the cross-correlation feature map, and the position of the central point of the target object in the current image is determined based on the pixel point classification map and the pixel point position distribution map, so that the accuracy of predicting the position of the central point is improved.

In one embodiment, the cross-correlation feature is a cross-correlation feature map, the resolution of the cross-correlation feature map being different from the resolution of the current image; determining the size of the target object in the current image according to the cross-correlation characteristics, comprising: predicting the size of the target object in the current image and the resolution adaptation offset of the central point position of the target object in the current image according to the cross-correlation characteristic graph; locating a target object in the current image based on the center point location and size, comprising: offsetting the position of the center point according to the resolution adaptation offset; and positioning a target area where the target object is located in the current image based on the position and the size of the center point after the deviation.

Specifically, when determining the center point position, the computer device first determines the center point position of the target object in the cross-correlation feature map, and then maps the center point position to the current image to obtain the center point position of the target object in the current image. However, the position of the center point of the target object in the current image obtained in this way may be deviated. This is because the resolution of the cross-correlation feature map is different from that of the current image, and when the center point position in the cross-correlation feature map is mapped into the current image, due to the up-sampling or down-sampling operation, the rounding operation on the non-integer may occur, and a certain error may exist.

Based on the above, the resolution adaptation offset of the central point position of the target object in the current image is predicted according to the cross-correlation feature map, so that the central point position predicted according to the cross-correlation feature map is corrected through the resolution adaptation offset. The resolution adaptation offset is offset compensation data when the pixel position is offset due to the fact that the resolution of the image changes.

In the embodiment, the position of the central point is corrected according to the resolution adaptation offset, and the accuracy of predicting the position of the central point of the target object is improved.

In one embodiment, predicting a size of the target object in the current image and a resolution adaptation offset of a center point position of the target object in the current image according to the cross-correlation feature includes: respectively and correspondingly predicting the size of the target object in the current image based on each pixel point in the cross-correlation characteristic graph; selecting the size of the center point position mapped to the corresponding prediction size of the pixel point of the cross-correlation characteristic graph from the sizes to obtain the size of the target object in the current image; and predicting the resolution adaptation offset of the central point position of the target object in the current image based on the resolution of the cross-correlation feature map and the resolution of the current image.

When the computer equipment determines the size of the target object, the computer equipment can predict the size of the target object corresponding to each pixel point in the cross-correlation characteristic graph as a central point, and the size corresponding to the central point position is obtained by screening from the sizes of the target objects corresponding to the pixel points in combination with the central point position.

Specifically, the computer device first obtains a pixel point classification graph and a pixel point position distribution graph according to the cross-correlation characteristic graph, further obtains a central point position of the target object based on the pixel point classification graph and the pixel point position distribution graph, screens out a size corresponding to the central point position from sizes of the target object corresponding to each pixel point of the cross-correlation characteristic graph by using the central point position, and outputs the screened size to the current image to obtain the size of the target object in the current image.

In this embodiment, the size corresponding to the central point position is screened out according to the central point position, so that the accuracy of size prediction of the target object is improved.

The image feature extraction, central point position prediction and size prediction processes involved in the scheme provided by the embodiment of the application can be realized by a neural network based on machine learning, and are specifically described by the following embodiments.

In one embodiment, acquiring a first image feature of a current image and a second image feature of a reference image comprises: acquiring two feature extraction networks which have the same model structure and share model parameters; respectively inputting the current image and the reference image into one of the two feature extraction networks; and outputting the first image characteristic of the current image and the second image characteristic of the reference image in parallel and respectively through the two characteristic extraction networks.

The feature extraction network is used for training a machine learning model with image feature extraction capability. The extracted image features of the general machine learning model with the image feature extraction capability, such as a ResNet model, a ResNet-50 model, an inclusion model, and the like, meet the requirements of the target object positioning method provided by the application on the image features, and the general machine learning model with the image feature extraction capability can be used as a feature extraction network of the target object positioning method provided by the application.

In a specific embodiment, the first image feature may be a first image feature map and the second image feature may be a second image feature map. The resolution of the first image feature map is different from that of the current image, and a corresponding relationship exists between pixel points in the first image feature map and pixel points in the current image, and the corresponding relationship is related to feature extraction parameters (such as convolution kernel size, step length and the like) used by the feature extraction network. Similarly, the resolution of the second image feature map is different from the resolution of the reference image, and a corresponding relationship exists between pixel points in the second image feature map and pixel points in the reference image, and the corresponding relationship is also related to the feature extraction parameters used by the feature extraction network.

In a specific embodiment, the resolution of the first image feature map and the second image feature map may be increased in consideration of the accuracy of the subsequent positioning of the target object, such as the accuracy of the positioning of the center point position of the target object. Taking the feature extraction network as the ResNet-50 model as an example, conv _ layer4 in the ResNet-50 model can be removed, and the convolution step size of conv _ layer2 is set to 1, so as to increase the resolution of the first image feature map and the second image feature map obtained through the feature extraction network.

Specifically, two feature extraction networks are provided, and the two feature extraction networks are completely consistent in model structure and model parameters. The computer device inputs the current image and the reference image into the two feature extraction networks respectively, so that the two feature extraction networks respectively extract a first image feature of the current image and a second image feature of the reference image.

For example, referring to fig. 8, fig. 8 is a schematic structural diagram of a target object location model in an embodiment. It can be seen that the current image and the reference image are input into the two feature extraction networks, respectively, one of which outputs the first image feature and the other of which outputs the second image feature.

It is to be understood that if the reference image is a first image of a sequence of image frames, during the process of performing target tracking on the sequence of image frames, the feature extraction network may extract a second image feature of the reference image once, and then locate the target object in a subsequent image frame based on the extracted second image feature.

In the embodiment, the two feature extraction networks extract the image features in parallel, so that the image feature extraction efficiency is improved, and the target object positioning is further improved.

With continued reference to fig. 8, it can be seen that, after the first image feature and the second image feature are obtained through the two feature extraction networks, the first image feature and the second image feature are subjected to correlation operation to obtain the cross-correlation feature.

Wherein the cross-correlation feature may be a cross-correlation feature map. The cross-correlation feature map is used to characterize the degree of similarity between the first image feature map and the second image feature map. The pixel points in the cross-correlation characteristic graph have pixel values representing similarity or difference.

Because the first image feature map and the second image feature map are obtained by twin feature extraction network extraction, the resolution of the first image feature map and the resolution of the second image feature map are the same. And calculating the similarity or the difference of the first image characteristic diagram and the second image characteristic diagram according to the position, so that pixel points in the cross-correlation characteristic diagram have pixel values representing the similarity or the difference.

The cross-correlation characteristic diagram may be the same as resolutions of the first image characteristic diagram and the second image characteristic diagram, the resolutions of the first image characteristic diagram and the second image characteristic diagram are different from the resolution of the current image, and a corresponding relationship exists between pixel points in the first image characteristic diagram and pixel points in the current image, that is, the resolutions of the cross-correlation characteristic diagram and the current image are different, and a corresponding relationship exists between pixel points in the cross-correlation characteristic diagram and pixel points in the current image.

In one embodiment, determining the position of the center point of the target object in the current image according to the cross-correlation features comprises: inputting the cross-correlation characteristics into a position prediction network, and obtaining a pixel point classification graph and a pixel point position distribution graph through the position prediction network; the pixel points in the pixel point classification graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the current image; pixel points belonging to the target category in the pixel point position distribution diagram have pixel values representing the distance between the pixel points and the center point of the target object, and correspond to the pixel points in the current image; processing the pixel point classification graph and the pixel point position distribution graph through a position prediction network to obtain the position of a central point of a target object in the current image; determining the size of the target object in the current image according to the cross-correlation characteristics, comprising: and inputting the cross-correlation characteristics into a size prediction network, and outputting the size of the target object in the current image through the size prediction network.

Wherein the location prediction network is a machine learning model with the ability to predict the location of the center point of the target object through sample learning. The size prediction network is a machine learning model with the ability to predict the size of a target object through sample learning.

Specifically, the resolution of the pixel point classification map is the same as that of the cross-correlation feature map, the resolution of the cross-correlation feature map is different from that of the current image, and a corresponding relationship exists between a pixel point in the cross-correlation feature map and a pixel point in the current image, that is, the resolution of the pixel point classification map is different from that of the current image, and a corresponding relationship exists between a pixel point in the pixel point classification map and a pixel point in the current image.

Specifically, the resolution of the pixel position distribution map is the same as that of the cross-correlation characteristic map, the resolution of the cross-correlation characteristic map is different from that of the current image, and a corresponding relationship exists between a pixel point in the cross-correlation characteristic map and a pixel point in the current image, that is, the resolution of the pixel position distribution map is different from that of the current image, and a corresponding relationship exists between a pixel point in the pixel position distribution map and a pixel point in the current image.

Specifically, the computer device respectively inputs the cross-correlation characteristics between the current image and the reference image into a position prediction network and a size prediction network, performs deep learning operation through the position prediction network to obtain the position of the central point of the target object in the current image, and performs deep learning operation through the size prediction network to obtain the size of the target object in the current image. The deep learning operation includes a deep operation of a neural network such as a convolution operation.

Wherein the location prediction network comprises two branches: a category predicted branch and a location predicted branch. The category prediction branch is used for obtaining a pixel point classification graph according to cross-correlation characteristic prediction, and the position prediction branch is used for obtaining a position distribution graph of pixel points of a target category according to the cross-correlation characteristic prediction. The position prediction network determines the central point position of the target object based on the pixel point classification graph and the pixel point position distribution graph.

In a specific embodiment, a prediction result image is obtained by multiplying the pixel point classification image and the pixel point position distribution image by a position through a position prediction network, a pixel point with the maximum pixel value is selected from the prediction result image to be used as a central point of a target object, and the central point position is mapped to a current image to obtain the central point position of the target object in the current image.

In a specific embodiment, the size of the target object is predicted through a size prediction network, and the size is output to the current image to obtain the size of the target object in the current image.

It will be appreciated that in the current image, a center point position and a size are ultimately output to locate the region where the target object is located.

For example, referring to fig. 9, fig. 9 is a schematic structural diagram of a target object location model in an embodiment. The cross-correlation characteristics are input into the position prediction network and the size prediction network respectively, the pixel point classification graph and the pixel point position distribution graph are obtained through the position prediction network, the center point of the target object is further obtained based on the pixel point classification graph and the pixel point position distribution graph, and the center point position is mapped to the current image to obtain the center point position of the target object in the current image. And obtaining the size of the target object through a size prediction network, and outputting the size to the current image to obtain the size of the target object in the current image. And finally, positioning the area where the target object is located by combining the position and the size of the central point of the target object in the current image.

In the embodiment, the position and the size of the central point of the target object in the current image are respectively determined through the position prediction network and the size prediction network, so that the area where the target object is located in the current image is positioned, the complicated operation that a plurality of candidate frames are output in the traditional target object positioning method and then the target frame is selected from the candidate frames is avoided, the model calculation amount is reduced, and the target object positioning speed is improved.

In one embodiment, the method further comprises: processing the pixel point classification graph and the pixel point position distribution graph through a position prediction network to obtain the position of a central point of a target object; passing the location of the center point to a size prediction network; inputting the cross-correlation characteristics into a size prediction network, and outputting the size of the target object in the current image through the size prediction network, wherein the method comprises the following steps: inputting the cross-correlation characteristic diagram into a size prediction network, and predicting the size of each pixel point in the cross-correlation characteristic diagram corresponding to a target object through the size prediction network; and screening out the size corresponding to the position of the central point by combining the position of the central point through a size prediction network and outputting the screened size.

The size prediction network can predict the size of a target object corresponding to each pixel point serving as a central point in the cross-correlation characteristic graph, and the size corresponding to the central point position is obtained by screening from the size of the target object corresponding to each pixel point in combination with the central point position.

Specifically, the computer device respectively inputs the cross-correlation characteristic graph into a position prediction network and a size prediction network, a pixel point classification graph and a pixel point position distribution graph are obtained through the position prediction network, a central point position of a target object is further obtained based on the pixel point classification graph and the pixel point position distribution graph, the central point position is mapped to the current image to obtain a central point position of the target object in the current image, and the central point position is transmitted to the size prediction network; and obtaining the size of each pixel point in the cross-correlation characteristic graph corresponding to the target object through a size prediction network, screening the size corresponding to the central point position by combining the central point position, and outputting the screened size to the current image to obtain the size of the target object in the current image.

In this embodiment, the position of the central point of the target object is obtained through the position prediction network, and the position of the central point is transmitted to the size prediction network, so that the size prediction network screens out the size corresponding to the position of the central point, and the accuracy of size prediction of the target object is improved.

In one embodiment, the cross-correlation feature is a cross-correlation feature map, the resolution of the cross-correlation feature map being different from the resolution of the current image; the method further comprises the following steps: processing the pixel point classification graph and the pixel point position distribution graph through a position prediction network to obtain the position of a central point of a target object; passing the location of the center point to a size prediction network; inputting the cross-correlation characteristics into a size prediction network, and outputting the size of the target object in the current image through the size prediction network, wherein the method comprises the following steps: inputting the cross-correlation characteristic diagram into a size prediction network, predicting the size of each pixel point in the cross-correlation characteristic diagram corresponding to the target object through the size prediction network, and predicting the resolution adaptation offset of the central point position of the target object in the current image; screening out the size corresponding to the central point position through a size prediction network in combination with the central point position and outputting the screened size; the method further comprises the following steps: passing the resolution adaptation offset to a location prediction network; and offsetting the position of the central point according to the resolution adaptation offset through a position prediction network, and outputting the offset position of the central point.

Wherein the size prediction network may further train the prediction capability with a resolution adaptation offset.

Specifically, the size prediction network includes two branches: size predicted branches and offset predicted branches. The computer equipment inputs the cross-correlation characteristic graph into a size prediction network, the size prediction branch carries out deep learning operation on the cross-correlation characteristic graph to obtain the size of each pixel point in the cross-correlation characteristic graph corresponding to a target object, and the offset prediction branch carries out deep learning operation on the cross-correlation characteristic graph to obtain the resolution adaptation offset of the central point position in the current image.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a target object location model in an embodiment. It can be seen that the cross-correlation features are respectively input into the position prediction network and the size prediction network, and the size and the resolution adaptation offset of the target object are obtained through the size prediction network.

Specifically, the computer device processes the pixel point classification map and the pixel point position distribution map through a position prediction network to obtain a central point position of the target object, and maps the central point position to the current image to obtain the central point position of the target object in the current image. However, the center point position obtained in this manner may be deviated. The pixel point classification graph and the pixel point position distribution graph are determined through a first image feature graph and a second image feature graph, the resolution of the first image feature graph and the resolution of the second image feature graph are different from that of the current image, and the image feature graph obtained through the feature extraction network is usually only integer coordinates, so that when the position of the central point of the target object is mapped to the current image, errors may exist in the position of the mapped central point.

For example, assuming that the coordinate of a pixel in the current image is (50,0), the down-sampling processing is performed through the feature extraction network, the coordinate of the pixel in the first image feature map is (6.25,0), and since the first image feature map has only integer coordinates, the actual coordinate of the pixel in the first image feature map is (6, 0).

Based on the method, the resolution adaptation offset of the central point position of the target object in the current image is predicted through the size prediction network, the central point position obtained by processing the pixel point classification diagram and the pixel point position distribution diagram through the position prediction network is corrected according to the resolution adaptation offset, and the accuracy of central point position prediction of the target object is improved.

Compared with the traditional method for positioning the target object, which outputs a plurality of candidate frames and selects the target frame from the candidate frames, the method has the advantages that the model calculation amount is reduced, and the target object positioning is improved.

In the embodiment, the resolution adaptation offset is determined through the size prediction network, and the position of the central point of the target object in the current image is corrected according to the resolution adaptation offset, so that the positioning accuracy of the target object is improved.

In one embodiment, the pixel point classification graph and the pixel point position distribution graph are processed through a position prediction network to obtain the position of a central point of a target object, and the position of the central point is mapped to a current image; inputting the cross-correlation characteristic diagram into a size prediction network, predicting the size of each pixel point in the cross-correlation characteristic diagram corresponding to a target object through the size prediction network, predicting the resolution adaptation offset of the central point position of the target object in the current image, and outputting the size of each pixel point corresponding to the target object and the resolution adaptation offset into the current image; offsetting the position of the central point according to the resolution adaptation offset to obtain the offset position of the central point; screening out the size corresponding to the central point position from the sizes of the pixel points corresponding to the target object according to the central point position; and positioning the area where the target object is located by combining the position of the center point after the deviation in the current image and the screened size.

In this embodiment, the central point position predicted by the position prediction network, the size of each pixel point predicted by the size prediction network corresponding to the target object, and the resolution adaptation offset are all output to the current image, the central point position is offset according to the resolution adaptation offset, and the area where the target object is located based on the offset central point position in the current image and the screened size, so that the accuracy of locating the target object is improved.

It should be noted that, the specific training process of the feature extraction network, the location prediction network and the size prediction network may refer to the detailed description in the subsequent embodiments.

In one embodiment, the target object is a tracking object; acquiring a current image and a reference image, comprising: acquiring a video to be tracked; determining a first frame video frame including a tracking object in a video to be tracked to obtain a reference image; sequentially acquiring video frames from a video frame next to a first frame video frame to be used as tracking images; and determining a target preselection area in the tracking image and intercepting the target preselection area to obtain a current image according to the position information of the tracking object in the previous frame of the tracking image.

The video to be tracked can be recorded video data or video data recorded in real time. The tracking object is an object to be positioned in each frame of the image frame. The tracking image is an image frame of the object to be positioned.

Specifically, a first frame video frame of a video to be tracked is used as a reference image, and a tracking object is sequentially positioned in each subsequent frame video frame, so that the tracking object is tracked in the video to be tracked. Considering that the position change of the target object between the front frame and the rear frame is small, the position of the target object in the image of the previous frame is taken as the center, and the current image is intercepted according to the specified search range to obtain a target preselection area so as to reduce the search range of the current image.

In the embodiment, in the target tracking process, the current image is intercepted by referring to the position of the target object in the previous frame of image, the search range of the current image is reduced, and the target object tracking speed is improved.

The application also provides an application scene, and the target object positioning method is applied to the application scene. The application scene may specifically be a single target tracking scene, that is, a position of a given target object in one frame image of an image frame sequence is output, a position of the target object in a subsequent frame image is output, and the target object is tracked in the image frame sequence. Referring to fig. 11, a pedestrian is tracked in an image frame sequence, taking the pedestrian as an example.

Specifically, as shown in fig. 12, the target object positioning method is applied to the application scenario as follows:

step 1202, a current image and a reference image are obtained.

And the reference image is an image of a target object in one frame before the current image in the image frame sequence in which the current image is located.

Step 1204, obtaining two feature extraction networks with the same model structure and sharing model parameters, inputting the current image and the reference image into one of the two feature extraction networks, and outputting a first image feature of the current image and a second image feature of the reference image in parallel and respectively through the two feature extraction networks.

And 1206, performing correlation operation on the first image characteristic and the second image characteristic to obtain a cross-correlation characteristic between the first image characteristic and the second image characteristic.

Wherein the cross-correlation feature is used to characterize a degree of similarity between the first image feature and the second image feature.

And step 1208, inputting the cross-correlation characteristics into a position prediction network, and obtaining a pixel point classification graph and a pixel point position distribution graph through the position prediction network.

The pixel points in the pixel point classification graph have pixel values representing the classification categories and correspond to the pixel points in the current image; the pixel points belonging to the target category in the pixel point position distribution map have pixel values representing distances from the center point of the target object, and correspond to the pixel points in the current image.

Step 1210, multiplying the pixel point classification graph and the pixel point position distribution graph by position through a position prediction network to obtain a prediction result graph, selecting a pixel point with the maximum pixel value from the prediction result graph as a central point of the target object, and transmitting the position of the central point to a size prediction network.

Step 1212, inputting the cross-correlation feature map into a size prediction network, predicting the size of each pixel point in the cross-correlation feature map corresponding to the target object through the size prediction network, predicting the resolution adaptation offset of the central point position of the target object in the current image, screening the size corresponding to the central point position through the size prediction network in combination with the central point position, and outputting the screened size.

Step 1214, the resolution adaptation offset is transmitted to the position prediction network, the position of the center point is shifted according to the resolution adaptation offset through the position prediction network, and the shifted position of the center point is output.

Step 1216, locate the target object in the current image based on the shifted center point position and the screened size.

In the embodiment, only one central point position and one size are output in the current image to position the area where the target object is located, so that the complicated operation of outputting a plurality of candidate frames and further selecting the target frame from the candidate frames in the traditional target object positioning method is avoided, the model calculation amount is reduced, and the target object positioning efficiency is improved.

In an embodiment, as shown in fig. 13, an image processing method is provided, and this embodiment is mainly exemplified by applying the method to the computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:

step 1302, acquiring an image sample pair, a position prediction network and a size prediction network; the image sample pair comprises a target sample and a reference sample; the target sample and the reference sample comprise the same sample object; the training label of the target sample is used for representing the labeled area of the sample object in the target sample.

Wherein the image sample pairs are data sets used to train a position prediction network and a size prediction network. The target sample is one of the frame images in the image frame sequence. The target sample and the reference sample comprise the same sample object, the sample object in the target sample is not located, and the sample object in the reference sample is located. The sample object is an object that can be tracked in the image frame sequence, and specifically may be an independent living body or object, such as a natural person, an animal, a vehicle, a virtual character, and the like, or a specific part, such as a face, a hand, and the like.

In one embodiment, the reference sample is an image of a sample object included in one of the image frame sequences in which the target sample is located, which frame precedes the target sample.

It will be appreciated that since the sample object in the reference sample is located, the sample object to be tracked may be selected by the reference sample and located in the target sample.

In one particular embodiment, the reference sample is from the same sequence of image frames as the target sample, and the reference sample is an image frame processed before the target sample. The reference sample may specifically be a first frame image of the image frame sequence, or may be an intermediate frame image of the image frame sequence. When the reference sample is the first frame image, the sample object may be selected based on a user operation so that the subsequent image frame may acquire the sample object to be tracked through the reference sample.

In a specific embodiment, the target sample may be a complete image of one frame in the image frame sequence, or may be a region of interest selected according to a tracking result of an image of a previous frame.

At step 1304, a cross-correlation characteristic sample between the target sample and the reference sample is determined.

Wherein the cross-correlation feature sample reflects a degree of similarity between the target sample and the reference sample.

The process of obtaining the cross-correlation characteristic sample between the target sample and the reference sample may specifically refer to the process of determining the cross-correlation characteristic between the current image and the reference image in the foregoing embodiment.

Step 1306, inputting the cross-correlation characteristic samples into a position prediction network to obtain the position of the predicted central point of the sample object in the target sample.

Here, the process of specifically obtaining the predicted central point position in the target sample may specifically refer to the process of determining the central point position of the target object through the position prediction network in the foregoing embodiment.

Step 1308, inputting the cross-correlation characteristic samples into a size prediction network to obtain the predicted size of the sample object in the target sample.

Here, the process of specifically obtaining the predicted size in the target sample may specifically refer to the process of determining the size of the target object through the size prediction network in the foregoing embodiment.

Step 1310, training a position prediction network and a size prediction network based on the predicted central point position, the predicted size and the training labels; and the position prediction network and the size prediction network obtained by training are respectively used for predicting the position and the size of the central point of the target object in the target image so as to position the target area of the target object in the target image.

Specifically, the computer device trains the position prediction network and the size prediction network based on the predicted center point position, the predicted size, and the training label to improve the prediction accuracy of the position prediction network on the center point position of the target object and the prediction accuracy of the size prediction network on the size of the target object.

In this embodiment, the cross-correlation feature between the target sample and the reference sample is obtained, the position prediction network is used to obtain the predicted central point position of the sample object in the target sample according to the cross-correlation feature, the size prediction network is used to obtain the predicted size of the sample object in the target sample according to the cross-correlation feature, the predicted area of the sample object in the target sample is located based on the predicted central point position and the predicted size, and the position prediction network and the size prediction network are trained according to the direction of minimizing the difference between the predicted area and the labeled area, so that when the trained position prediction network and size prediction network locate the target object in the target image, only one central point position and one size are output in the target image to locate the area where the target object is located, thereby avoiding outputting a plurality of candidate frames in the conventional target object locating method, And further, the complicated operation of selecting the target frame from the candidate frames reduces the model calculation amount and improves the target object positioning efficiency.

In one embodiment, the training label of the target sample comprises a marking central point position and a marking size of the sample object in the target sample; inputting the cross-correlation characteristic samples into a position prediction network to obtain the position of a predicted central point of the sample object in the target sample, wherein the position prediction network comprises the following steps: inputting the cross-correlation characteristic sample into a position prediction network to obtain a pixel point classification sample picture and a pixel point position distribution sample picture, and continuously obtaining the predicted central point position of a sample object in a target sample according to the pixel point classification sample picture and the pixel point position distribution sample picture; the pixel points in the pixel point classification sample graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the target sample; pixel points belonging to the target category in the pixel point position distribution sample graph have pixel values representing the distance from the marking center point position and correspond to the pixel points in the target sample; training a position prediction network and a size prediction network based on the predicted midpoint location, the predicted size, and the training labels, comprising: and training a position prediction network and a size prediction network based on the pixel point classification sample graph, the pixel point position distribution sample graph, the prediction central point position, the prediction size and the training label.

The process of obtaining the pixel classification sample map, the pixel location distribution sample map, and the predicted center point location may specifically refer to the process of determining the pixel classification map and the pixel location distribution map through the location prediction network, and determining the center point location of the target object based on the pixel classification map and the pixel location distribution map in the foregoing embodiment.

The pixel points belonging to the target category in the pixel point position distribution sample graph have pixel values representing the distance between the pixel points and the marking central point position, and the pixel values can be calculated through the following formula:

wherein the content of the first and second substances,

distributing pixel values in the sample graph for pixel points of the target category at pixel point positions; x is the number of_g、y_gMarking the position of a central point; w and h are the width and height of the pixel position distribution sample graph; k is a proportionality coefficient; i. j is the coordinate of the pixel point of the target category in the pixel point position distribution sample graph.

Specifically, the computer device trains a position prediction network and a size prediction network based on the pixel point classification sample map, the pixel point position distribution sample map, the prediction center point position, the prediction size, and the training labels.

Specifically, the training label comprises a marking central point position and a marking size of the sample object in the target sample. Determining the category of each pixel point in the cross-correlation characteristic sample according to the labeling central point position and the labeling size, training a position prediction network according to the direction of reducing the difference between the pixel point classification sample graph and the category of each pixel point determined according to the labeling central point position and the labeling size, and improving the accuracy of the position prediction network in predicting the category of each pixel point in the cross-correlation characteristic sample.

The position distribution of the target category pixel points in the cross-correlation characteristic sample is determined according to the labeling central point position and the labeling size, the position prediction network is trained according to the direction of reducing the difference between the pixel point position distribution sample diagram and the position distribution of the target category pixel points determined according to the labeling central point position and the labeling size, and the accuracy of the position prediction network in predicting the position distribution of the target category pixel points in the cross-correlation characteristic sample is improved.

And according to the position of the marking center point and the marking size, determining the size of the marking center point corresponding to the sample object, training a size prediction network according to the direction of reducing the difference between the predicted size corresponding to the marking center point and the size of the marking center point corresponding to the sample object, and improving the accuracy of the size prediction network for the size corresponding to the marking center point.

In this embodiment, based on the pixel point classification sample map, the pixel point position distribution sample map, the predicted central point position, the predicted size, and the training label, the position prediction network and the size prediction network are trained, so that the prediction accuracy of the position prediction network on the central point position of the target object and the prediction accuracy of the size prediction network on the size of the target object are improved.

In one embodiment, inputting the cross-correlation feature samples into a size prediction network to obtain the predicted size of the sample object in the target sample, comprises: inputting the cross-correlation characteristic samples into a size prediction network to obtain resolution adaptation offset samples and a predicted size of a marked central point position corresponding to a sample object in a target sample; based on the pixel point classification sample graph, the pixel point position distribution sample graph, the prediction central point position, the prediction size and the training label, the training position prediction network and the size prediction network comprise: determining training labels corresponding to all pixel points of the pixel point classification sample graph and training labels corresponding to pixel points belonging to the target category in the pixel point position distribution sample graph according to the training labels; constructing a first loss function based on the difference between the pixel value of each pixel point of the pixel point classification sample graph and the corresponding training label, and constructing a second loss function based on the difference between the pixel value of the pixel point belonging to the target category in the pixel point position distribution sample graph and the corresponding training label; jointly training a position estimation network according to the first loss function and the second loss function; offsetting the position of the predicted central point according to the resolution adaptive offset; positioning a prediction area of the sample object in the target sample according to the shifted prediction central point position and the prediction size; constructing a third loss function according to the difference between the prediction region and the labeling region; the size estimation network is trained according to a third loss function.

Specifically, the category of each pixel point in the cross-correlation characteristic sample is determined according to the labeling central point position and the labeling size, a loss function is constructed according to the difference between the pixel point classification sample graph and the category of each pixel point determined according to the labeling central point position and the labeling size, the position prediction network is trained through the loss function, and the accuracy of the position prediction network in predicting the category of each pixel point in the cross-correlation characteristic sample is improved.

The method comprises the steps of determining the position distribution of target category pixel points in a cross-correlation characteristic sample according to the position of a labeling central point and a labeling size, constructing a loss function according to the difference between a pixel point position distribution sample graph and the position distribution of the target category pixel points determined according to the position of the labeling central point and the labeling size, training a position prediction network through the loss function, and improving the accuracy of the position prediction of the target category pixel points in the cross-correlation characteristic sample by the position prediction network.

In a particular embodiment, the location prediction network may employ a cross-entropy loss function or the like.

For a size prediction network, the following loss function may be employed for training:

L_box＝Loss(gt,box)

wherein gt is the labeled region and box is the predicted region.

gt＝(x_g,y_g,w_g,h_g)

Wherein x is_g、y_gTo mark the location of the center point, w_g、h_gAre labeled with dimensions.

Wherein x is_c、y_cTo predict the location of the center point, w_c、h_cTo predict size, map to (0, + ∞) by exponential function; and Δ x and Δ y are resolution adaptation offset values and are mapped to (-1,1) through a hyperbolic tangent function.

In a specific embodiment, the size prediction network may use an iou (intersection over union) loss function, etc.

It can be understood that, for each pixel point in the cross-correlation characteristic sample, the size corresponding to the sample object is obtained through prediction, and when the position prediction network is trained through the loss function, only the loss value of the size corresponding to the position of the marked central point can be calculated, so that the accuracy of size prediction corresponding to the position of the central point is improved.

In this embodiment, the difference between the pixel value of each pixel in the pixel classification sample map and the corresponding training label and the difference between the pixel value of the pixel belonging to the target category in the pixel position distribution sample map and the corresponding training label are reduced by the loss function, so that the prediction accuracy of the position estimation network for the center point position of the target object is improved, and the prediction accuracy of the size prediction network for the size of the target object is improved by the loss function by reducing the difference between the prediction region and the labeling region.

It should be understood that, although the steps in the flowcharts of fig. 2, 12, and 13 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2, 12, and 13 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

In one embodiment, as shown in fig. 14, there is provided a target object positioning apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an acquisition module 1402, a determination module 1404, and a positioning module 1406, wherein:

an obtaining module 1402, configured to obtain a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence where the current image is located;

a determining module 1404 for determining cross-correlation features between the current image and the reference image;

a determining module 1404, configured to determine a position of a center point of the target object in the current image according to the cross-correlation feature;

a determining module 1404, further configured to determine a size of the target object in the current image according to the cross-correlation feature;

a location module 1406 for locating a target object in the current image based on the center point location and the size.

In one embodiment, the determining module 1404 is further configured to: acquiring a first image characteristic of a current image and a second image characteristic of a reference image; performing correlation operation on the first image characteristic and the second image characteristic to obtain a cross-correlation characteristic between the first image characteristic and the second image characteristic; the cross-correlation feature is used to characterize a degree of similarity between the first image feature and the second image feature.

In one embodiment, the determining module 1404 is further configured to: acquiring two feature extraction networks which have the same model structure and share model parameters; respectively inputting the current image and the reference image into one of the two feature extraction networks; and outputting the first image characteristic of the current image and the second image characteristic of the reference image in parallel and respectively through the two characteristic extraction networks.

In one embodiment, the determining module 1404 is further configured to: generating a pixel point classification graph and a pixel point position distribution graph according to the cross-correlation characteristics; the pixel points in the pixel point classification graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the current image; pixel points belonging to the target category in the pixel point position distribution diagram have pixel values representing the distance between the pixel points and the center point of the target object, and correspond to the pixel points in the current image; and selecting the central point of the target object based on the pixel point classification graph and the pixel point position distribution graph to obtain the central point position of the target object in the current image.

In one embodiment, the determining module 1404 is further configured to: multiplying the pixel point classification graph and the pixel point position distribution graph in a bit-by-bit manner to obtain a prediction result graph; selecting a pixel point with the maximum pixel value as a central point of a target object from the prediction result image; and determining the central point mapping to the target position of the current image to obtain the central point position of the target object in the current image.

In one embodiment, the cross-correlation feature is a cross-correlation feature map, the resolution of the cross-correlation feature map being different from the resolution of the current image; a determining module 1404, further configured to: predicting the size of the target object in the current image and the resolution adaptation offset of the central point position of the target object in the current image according to the cross-correlation characteristic graph; a positioning module 1406 further configured to: offsetting the position of the center point according to the resolution adaptation offset; and positioning a target area where the target object is located in the current image based on the position and the size of the center point after the deviation.

In one embodiment, the determining module 1404 is further configured to: respectively and correspondingly predicting the size of the target object in the current image based on each pixel point in the cross-correlation characteristic graph; selecting the size of the center point position mapped to the corresponding prediction size of the pixel point of the cross-correlation characteristic graph from the sizes to obtain the size of the target object in the current image; and predicting the resolution adaptation offset of the central point position of the target object in the current image based on the resolution of the cross-correlation feature map and the resolution of the current image.

In one embodiment, the determining module 1404 is further configured to: inputting the cross-correlation characteristics into a position prediction network, and obtaining a pixel point classification graph and a pixel point position distribution graph through the position prediction network; the pixel points in the pixel point classification graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the current image; pixel points belonging to the target category in the pixel point position distribution diagram have pixel values representing the distance between the pixel points and the center point of the target object, and correspond to the pixel points in the current image; processing the pixel point classification graph and the pixel point position distribution graph through a position prediction network to obtain the position of a central point of a target object in the current image; a determining module 1404, further configured to: and inputting the cross-correlation characteristics into a size prediction network, and outputting the size of the target object in the current image through the size prediction network.

In one embodiment, the cross-correlation feature is a cross-correlation feature map, the resolution of the cross-correlation feature map being different from the resolution of the current image; a determining module 1404, further configured to: processing the pixel point classification graph and the pixel point position distribution graph through a position prediction network to obtain the position of a central point of a target object; passing the location of the center point to a size prediction network; a determining module 1404, further configured to: inputting the cross-correlation characteristic diagram into a size prediction network, predicting the size of each pixel point in the cross-correlation characteristic diagram corresponding to the target object through the size prediction network, and predicting the resolution adaptation offset of the central point position of the target object in the current image; screening out the size corresponding to the central point position through a size prediction network in combination with the central point position and outputting the screened size; an offset module to: passing the resolution adaptation offset to a location prediction network; and offsetting the position of the central point according to the resolution adaptation offset through a position prediction network, and outputting the offset position of the central point.

In one embodiment, the target object is a tracking object; an obtaining module 1402, further configured to: acquiring a video to be tracked; determining a first frame video frame including a tracking object in a video to be tracked to obtain a reference image; sequentially acquiring video frames from a video frame next to a first frame video frame to be used as tracking images; and determining a target preselection area in the tracking image and intercepting the target preselection area to obtain a current image according to the position information of the tracking object in the previous frame of the tracking image.

For specific limitations of the target object positioning device, reference may be made to the above limitations of the target object positioning method, which are not described herein again. The modules in the target object positioning device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In the target object positioning device, the current image and the reference image are obtained, the cross-correlation characteristic between the current image and the reference image is firstly determined, then the central point position of the target object in the current image is determined according to the cross-correlation characteristic, the size of the target object in the current image is determined according to the cross-correlation characteristic, and the target object is positioned in the current image based on the central point position and the size, so that only one central point position and one size are output in the current image to position the area where the target object is located, the complicated operation of outputting a plurality of candidate frames and further selecting the target frame from the candidate frames in the traditional target object positioning method is avoided, the model calculation amount is reduced, and the target object positioning efficiency is improved.

In one embodiment, as shown in fig. 15, an image processing apparatus is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an acquisition module 1502, a determination module 1504, a prediction module 1506, and a training module 1508, wherein:

an obtaining module 1502 for obtaining an image sample pair, a location prediction network, and a size prediction network; the image sample pair comprises a target sample and a reference sample; the target sample and the reference sample comprise the same sample object; the training label of the target sample is used for representing the labeling area of the sample object in the target sample;

a determining module 1504 for determining cross-correlation feature samples between the target sample and the reference sample;

the prediction module 1506 is configured to input the cross-correlation feature sample into a position prediction network to obtain a predicted central point position of the sample object in the target sample;

the prediction module 1506 is further configured to input the cross-correlation feature samples into a size prediction network to obtain a predicted size of the sample object in the target sample;

a training module 1508 to train a location prediction network and a size prediction network based on the predicted midpoint location, the predicted size, and the training labels; and the position prediction network and the size prediction network obtained by training are respectively used for predicting the position and the size of the central point of the target object in the target image so as to position the target area of the target object in the target image.

In one embodiment, the training label of the target sample comprises a marking central point position and a marking size of the sample object in the target sample; the prediction module 1506 is further configured to: inputting the cross-correlation characteristic sample into a position prediction network to obtain a pixel point classification sample picture and a pixel point position distribution sample picture, and continuously obtaining the predicted central point position of a sample object in a target sample according to the pixel point classification sample picture and the pixel point position distribution sample picture; the pixel points in the pixel point classification graph have pixel values representing the classification categories of the pixel points and correspond to the pixel points in the target sample; pixel points belonging to the target category in the pixel point position distribution map have pixel values representing the distance between the pixel points and the marking central point position, and correspond to the pixel points in the target sample; a training module 1508 further configured to: and training a position prediction network and a size prediction network based on the pixel point classification sample graph, the pixel point position distribution sample graph, the prediction central point position, the prediction size and the training label.

In one embodiment, the prediction module 1506 is further configured to: inputting the cross-correlation characteristic samples into a size prediction network to obtain resolution adaptation offset samples and a predicted size of a marked central point position corresponding to a sample object in a target sample; a training module 1508 further configured to: determining training labels corresponding to all pixel points of the pixel point classification sample graph and training labels corresponding to pixel points belonging to the target category in the pixel point position distribution sample graph according to the training labels; constructing a first loss function based on the difference between the pixel value of each pixel point of the pixel point classification sample graph and the corresponding training label, and constructing a second loss function based on the difference between the pixel value of the pixel point belonging to the target category in the pixel point position distribution sample graph and the corresponding training label; jointly training a position estimation network according to the first loss function and the second loss function; offsetting the position of the predicted central point according to the resolution adaptive offset; positioning a prediction area of the sample object in the target sample according to the shifted prediction central point position and the prediction size; constructing a third loss function according to the difference between the prediction region and the labeling region; the size estimation network is trained according to a third loss function.

For specific limitations of the image processing apparatus, reference may be made to the above limitations of the image processing method, which are not described herein again. The respective modules in the image processing apparatus described above may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In the image processing device, the cross-correlation characteristic between the target sample and the reference sample is obtained, the position prediction network is firstly used for obtaining the predicted central point position of the sample object in the target sample according to the cross-correlation characteristic, the size prediction network is used for obtaining the predicted size of the sample object in the target sample according to the cross-correlation characteristic, then the predicted area of the sample object in the target sample is positioned based on the predicted central point position and the predicted size, and then the position prediction network and the size prediction network are trained according to the direction of minimizing the difference between the predicted area and the marked area, so that when the trained position prediction network and size prediction network are used for positioning the target object in the target image, only one central point position and one size are output in the target image to position the area where the target object is located, and the output of a plurality of candidate frames in the traditional target object positioning method is avoided, And further, the complicated operation of selecting the target frame from the candidate frames reduces the model calculation amount and improves the target object positioning efficiency.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 16. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing object positioning data and/or image processing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a target object localization method and/or an image processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 16 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for locating a target object, the method comprising:

acquiring a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence of the current image;

determining cross-correlation features between the current image and the reference image;

determining the position of the central point of the target object in the current image according to the cross-correlation characteristics;

determining the size of the target object in the current image according to the cross-correlation characteristic;

based on the center point location and the size, the target object is located in the current image.

2. The method of claim 1, wherein determining cross-correlation features between the current image and the reference image comprises:

acquiring a first image characteristic of the current image and a second image characteristic of the reference image;

performing correlation operation on the first image characteristic and the second image characteristic to obtain a cross-correlation characteristic between the first image characteristic and the second image characteristic; the cross-correlation feature is used to characterize a degree of similarity between the first image feature and the second image feature.

3. The method of claim 2, wherein the obtaining the first image feature of the current image and the second image feature of the reference image comprises:

acquiring two feature extraction networks which have the same model structure and share model parameters;

inputting the current image and the reference image into one of the two feature extraction networks respectively;

and outputting a first image feature of the current image and a second image feature of the reference image in parallel and respectively through the two feature extraction networks.

4. The method of claim 1, wherein the determining the position of the center point of the target object in the current image according to the cross-correlation feature comprises:

generating a pixel point classification graph and a pixel point position distribution graph according to the cross-correlation characteristics; the pixel points in the pixel point classification graph have pixel values representing the classification classes of the pixel points and correspond to the pixel points in the current image; pixel points belonging to a target category in the pixel point position distribution map have pixel values representing distances between the pixel points and the center point of the target object, and correspond to the pixel points in the current image;

and selecting the central point of the target object based on the pixel point classification graph and the pixel point position distribution graph to obtain the position of the central point of the target object in the current image.

5. The method according to claim 4, wherein the selecting a center point of the target object based on the pixel point classification map and the pixel point position distribution map to obtain a position of the center point of the target object in the current image comprises:

multiplying the pixel point classification graph and the pixel point position distribution graph according to positions to obtain a prediction result graph;

selecting a pixel point with the maximum pixel value from the prediction result image as a central point of the target object;

and determining the target position of the central point mapped to the current image to obtain the central point position of the target object in the current image.

6. The method of claim 1, wherein the cross-correlation feature is a cross-correlation feature map having a resolution different from a resolution of the current image; the determining the size of the target object in the current image according to the cross-correlation feature comprises:

predicting the size of the target object in the current image and the resolution adaptation offset of the central point position of the target object in the current image according to the cross-correlation feature map;

said locating said target object in said current image based on said center point location and said size comprises:

shifting the position of the central point according to the resolution adaptation offset;

and positioning a target area where the target object is located in the current image based on the shifted central point position and the size.

7. The method of claim 6, wherein predicting the size of the target object in the current image and the resolution adaptation offset of the position of the center point of the target object in the current image according to the cross-correlation feature comprises:

respectively and correspondingly predicting the size of the target object in the current image based on each pixel point in the cross-correlation characteristic graph;

selecting the size of the center point position mapped to the corresponding prediction size of the pixel point of the cross-correlation characteristic graph from the sizes to obtain the size of the target object in the current image;

predicting a resolution adaptation offset of a center point position of the target object in the current image based on a resolution of the cross-correlation feature map and a resolution of the current image.

8. The method of claim 1, wherein the determining the position of the center point of the target object in the current image according to the cross-correlation feature comprises:

inputting the cross-correlation characteristics into a position prediction network, and obtaining a pixel point classification graph and a pixel point position distribution graph through the position prediction network; the pixel points in the pixel point classification graph have pixel values representing the classification classes of the pixel points and correspond to the pixel points in the current image; pixel points belonging to a target category in the pixel point position distribution map have pixel values representing distances between the pixel points and the center point of the target object, and correspond to the pixel points in the current image;

processing the pixel point classification graph and the pixel point position distribution graph through the position prediction network to obtain the position of the central point of the target object in the current image;

the determining the size of the target object in the current image according to the cross-correlation feature comprises:

inputting the cross-correlation features into a size prediction network, and outputting the size of the target object in the current image through the size prediction network.

9. The method of claim 8, wherein the cross-correlation feature is a cross-correlation feature map having a resolution different from a resolution of the current image;

the method further comprises the following steps:

processing the pixel point classification graph and the pixel point position distribution graph through the position prediction network to obtain the position of the central point of the target object;

communicating the center point location to the size prediction network;

the inputting the cross-correlation feature into a size prediction network, and outputting the size of the target object in the current image through the size prediction network, including:

inputting the cross-correlation characteristic diagram into the size prediction network, predicting the size of each pixel point in the cross-correlation characteristic diagram corresponding to the target object through the size prediction network, and predicting the resolution adaptation offset of the central point position of the target object in the current image;

screening out the size corresponding to the central point position by combining the central point position through the size prediction network and outputting the screened size;

the method further comprises the following steps:

communicating the resolution adaptation offset to the location prediction network;

and shifting the position of the central point according to the resolution adaptation offset through the position prediction network, and outputting the shifted position of the central point.

10. The method of claim 1, wherein the target object is a tracking object; the acquiring the current image and the reference image includes:

acquiring a video to be tracked;

determining a first frame video frame including the tracking object in the video to be tracked to obtain a reference image;

sequentially acquiring video frames from a video frame next to the first frame of video frame to be used as tracking images; and determining a target preselection area in the tracking image according to the position information of the tracking object in the previous frame of the tracking image, and intercepting the target preselection area to obtain a current image.

11. An image processing method, characterized in that the method comprises:

determining a cross-correlation feature sample between the target sample and the reference sample;

inputting the cross-correlation characteristic sample into the position prediction network to obtain the predicted central point position of the sample object in the target sample;

inputting the cross-correlation characteristic sample into the size prediction network to obtain the predicted size of the sample object in the target sample;

training the location prediction network and the size prediction network based on the predicted center point location, the predicted size, and the training labels; and the position prediction network and the size prediction network obtained by training are respectively used for predicting the position and the size of the central point of the target object in the target image so as to position the target area of the target object in the target image.

12. An apparatus for locating a target object, the apparatus comprising:

the acquisition module is used for acquiring a current image and a reference image; the reference image is an image of a target object in one frame before the current image in the image frame sequence of the current image;

a determination module for determining cross-correlation features between the current image and the reference image;

the determining module is further configured to determine a position of a center point of the target object in the current image according to the cross-correlation feature;

the determining module is further configured to determine a size of the target object in the current image according to the cross-correlation feature;

a positioning module to position the target object in the current image based on the center point position and the size.

13. An image processing apparatus, characterized in that the apparatus comprises:

a determination module for determining a cross-correlation feature sample between the target sample and the reference sample;

the prediction module is used for inputting the cross-correlation characteristic sample into the position prediction network to obtain the position of a predicted central point of the sample object in the target sample;

the prediction module is further configured to input the cross-correlation feature sample into the size prediction network to obtain a predicted size of the sample object in the target sample;

a training module to train the location prediction network and the size prediction network based on the predicted center point location, the predicted size, and the training labels; and the position prediction network and the size prediction network obtained by training are respectively used for predicting the position and the size of the central point of the target object in the target image so as to position the target area of the target object in the target image.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.