CN114764870A

CN114764870A - Object positioning model processing method, object positioning device and computer equipment

Info

Publication number: CN114764870A
Application number: CN202111646817.2A
Authority: CN
Inventors: 彭瑾龙; 蒋正锴; 罗泽坤; 王昌安; 李剑; 王亚彪; 汪铖杰; 李季檩; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-13
Filing date: 2021-12-29
Publication date: 2022-07-19

Abstract

The application relates to an object positioning model processing method, an object positioning model processing device, computer equipment and a storage medium, and an object positioning method, an object positioning model processing device, computer equipment and a storage medium. The object positioning model processing method comprises the following steps: determining a regression region based on the image characteristics of the training sample image and a regression network of the object positioning model; calculating regression accuracy based on a target object labeling area corresponding to the training sample image, and calculating regression loss based on the regression accuracy; determining a classification confidence coefficient based on the image characteristics of the training sample image and a classification network of the object positioning model, and calculating a classification loss based on the classification confidence coefficient; updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence; training an object positioning model according to the updated classification loss and the updated regression loss to obtain a trained object positioning model; the trained object positioning model is used for carrying out object positioning on the input image. The method can improve the positioning accuracy.

Description

Object positioning model processing method, object positioning device and computer equipment

The present application claims priority of chinese patent application entitled "object localization model processing, object localization method, apparatus, and computer device" filed by the chinese patent office on 13/01/2021 with application number 2021100429568, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an object location model processing method, an object location model processing apparatus, a computer device, and a storage medium, and an object location method, an object location model processing apparatus, a computer device, and a storage medium.

Background

With the rapid development of computer technology and image processing technology, more and more fields relate to the need of positioning a target object in an image during image processing so as to perform subsequent applications according to the positioned target object. For example, the target recognition technology has very important applications in the fields of automatic driving, smart cities, smart homes and the like.

In the conventional technology, when an object is positioned, a classification network and a regression network of an object positioning model are independent from each other, and a region with the highest classification confidence is often not a region with the highest regression accuracy, so that the positioning accuracy is low.

Disclosure of Invention

In view of the above, it is necessary to provide an object positioning model processing method, an apparatus, a computer device and a storage medium capable of improving positioning accuracy, and an object positioning method, an apparatus, a computer device and a storage medium.

A method of object localization model processing, the method comprising:

acquiring a training sample image including a target object;

determining a regression region corresponding to the training sample image based on the image features of the training sample image and a regression network of an object positioning model;

calculating regression accuracy of the regression region based on a target object labeling region corresponding to a training sample image, and calculating regression loss based on the regression accuracy;

determining a classification confidence of the regression region based on image features of the training sample images and a classification network of the object localization model, and calculating a classification loss based on the classification confidence;

updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence;

training the object positioning model according to the updated classification loss and the updated regression loss until the training stopping condition is met, and obtaining the trained object positioning model;

The trained object positioning model is used for positioning an object of an input image.

An object localization model processing apparatus, the apparatus comprising:

the training sample acquisition module is used for acquiring a training sample image comprising a target object;

the regression region determining module is used for determining a regression region corresponding to the training sample image based on the image characteristics of the training sample image and the regression network of the object positioning model;

the regression loss calculation module is used for calculating the regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image and calculating the regression loss based on the regression accuracy;

the classification loss calculation module is used for determining the classification confidence coefficient of the regression region based on the image characteristics of the training sample image and the classification network of the object positioning model, and calculating the classification loss based on the classification confidence coefficient;

an update module to update the classification loss based on the regression accuracy and update the regression loss based on the classification confidence;

the training module is used for training the object positioning model according to the updated classification loss and the updated regression loss until the training stopping condition is met, and obtaining the trained object positioning model; the trained object positioning model is used for positioning an object of an input image.

In one embodiment, the determining a regression region corresponding to the training sample image based on the image features of the training sample image and a regression network of the object location model includes:

determining a prediction central point position corresponding to a target object in the training sample image and a prediction size corresponding to the target object based on the image characteristics of the training sample image and a regression network of an object positioning model;

and determining a regression region according to the position of the predicted central point and the predicted size.

acquiring the position of an anchor frame corresponding to the training sample image;

determining the offset corresponding to the anchor frame based on the image characteristics of the training sample image and a regression network of an object positioning model;

determining a regression region based on the location of the anchor frame and the offset of the anchor frame.

In one embodiment, before the positioning the object to be recognized from the image to be recognized according to the classification confidence of each regression region of the image to be recognized, the method further comprises:

Inputting the first cross-correlation identification characteristic into a regression accuracy prediction network of the object positioning model to obtain a regression accuracy prediction value of each regression area of the image to be identified;

the positioning the object to be recognized from the image to be recognized according to the classification confidence of each regression region of the image to be recognized comprises:

correspondingly multiplying the regression accuracy prediction value of each regression region of the image to be recognized with the classification confidence coefficient to obtain a target positioning score of each regression region of the image to be recognized;

and positioning the object to be recognized from the image to be recognized according to the target positioning scores of all regression regions of the image to be recognized.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above object localization model processing method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned object localization model processing method.

The object positioning model processing method, device, computer equipment and storage medium, firstly obtaining a training sample image comprising a target object, determining a regression region corresponding to the training sample image based on image features of the training sample image and a regression network of an object positioning model, calculating regression accuracy of the regression region based on a target object labeling region corresponding to the training sample image, calculating regression loss based on the regression accuracy, determining classification confidence of the regression region based on image features of the training sample image and the classification network of the object positioning model, calculating the classification loss based on the classification confidence, further updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence, and finally training the object positioning model according to the updated classification loss and the updated regression loss, and obtaining a trained object positioning model until a training stopping condition is met, wherein in the training process, the losses of the regression network and the classification network are respectively updated, so that the classification confidence coefficient is considered in the regression loss of the regression network, and the regression accuracy is considered in the classification loss of the classification network, therefore, after the target positioning model is trained through the updated loss, the inconsistency between the classification network and the regression network is reduced in the obtained trained target positioning model, the classification confidence coefficient of a regression region with accurate regression is higher, meanwhile, the regression region with higher classification confidence coefficient regresses more accurately, and the positioning accuracy is improved.

A method of object localization, the method comprising:

acquiring an input image;

obtaining a trained object positioning model; the object positioning model comprises a classification network and a regression network; the object positioning model is obtained through target regression loss of the regression network and target classification loss training of the classification network; the target regression loss is obtained by updating the initial regression loss through the classification confidence coefficient of the regression region; the target classification loss is obtained by updating the initial classification loss through the regression accuracy of the regression region; the regression region is determined based on image features of the training sample images and a regression network of an object localization model; the regression accuracy of the regression region is calculated based on a target object labeling region corresponding to the training sample image; the classification confidence of the regression region is determined based on the image features of the training sample images and the classification network; the initial classification loss is calculated based on the classification confidence; the initial regression loss is calculated based on the regression accuracy;

determining a plurality of regression regions for the input image based on image features of the input image and the regression network;

Determining a classification confidence for each regression region of the input image based on image features of the input image and the classification network;

and carrying out object positioning on the input image according to the classification confidence of each regression region of the input image to obtain the region of the target object.

An object positioning device, the device comprising:

the image acquisition module is used for acquiring an input image;

the model acquisition module is used for acquiring the trained object positioning model; the object positioning model comprises a classification network and a regression network; the object positioning model is obtained through target regression loss of the regression network and target classification loss training of the classification network; the target regression loss is obtained by updating the initial regression loss through the classification confidence coefficient of the regression region; the target classification loss is obtained by updating the initial classification loss through the regression accuracy of the regression region; the regression region is determined based on image features of the training sample images and a regression network of an object localization model; the regression accuracy of the regression region is calculated based on a target object labeling region corresponding to the training sample image; the classification confidence of the regression region is determined based on the image features of the training sample images and the classification network; the initial classification loss is calculated based on the classification confidence; the initial regression loss is calculated based on the regression accuracy;

A regression region determination module for determining a plurality of regression regions of the input image based on the image features of the input image and the regression network;

a confidence determination module for determining a classification confidence of each regression region of the input image based on the image features of the input image and the classification network;

and the positioning module is used for positioning the object of the input image according to the classification confidence of each regression region of the input image to obtain the region where the target object is located.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the object localization method described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the object localization method as described above.

According to the object positioning method, the object positioning device, the computer equipment and the storage medium, the object positioning model is obtained through training of the target regression loss of the regression network and the target classification loss of the classification network, in the training process, the losses of the regression network and the classification network are updated respectively, so that the classification confidence coefficient is considered in the regression loss of the regression network, and the regression accuracy is considered in the classification loss of the classification network, therefore, after the object positioning model is trained through updating of the loss training target positioning model, the inconsistency between the classification network and the regression network is relieved in the trained target positioning model, the classification confidence coefficient of a regression region with accurate regression is higher, meanwhile, the regression region with higher classification confidence coefficient regresses more accurately, and the positioning accuracy is improved.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment for an object localization model processing method and an object localization method;

FIG. 2 is a flowchart illustrating a method for processing an object location model in accordance with one embodiment;

FIG. 3 is a diagram illustrating the structure of an object location model in one embodiment;

FIG. 4 is a schematic diagram illustrating locations of a labeled region and a regression region of a target object according to an embodiment;

FIG. 5 is a schematic illustration of center points and dimensions in one embodiment;

FIG. 6 is a flowchart illustrating a method for locating objects in one embodiment;

FIG. 7 is a flowchart illustrating an object location method according to another embodiment;

FIG. 8 is a diagram illustrating output results of an embodiment when an object location model is used for object recognition;

FIG. 9 is an overall flowchart of an object location method in one embodiment;

FIG. 10 is a diagram illustrating the result of human body recognition according to an embodiment;

FIG. 11 is a block diagram showing an example of the structure of an object localization model processing apparatus;

FIG. 12 is a block diagram of an object locating device in one embodiment;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, waiting for identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence and the like, and is specifically explained by the following embodiment:

The object positioning model processing method provided by the application can be applied to the application environment shown in fig. 1. The terminal and the server can independently execute the object positioning model processing method, and the terminal and the server can also cooperatively execute the object positioning model processing method. For example, the terminal first acquires a training sample image containing a target object, sends the training sample image to the server, and the server trains an object positioning model according to the training sample image.

Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be an independent physical server, a server cluster or a distributed system formed by multiple physical servers, or a cloud server providing basic cloud computing services such as cloud storage, network services, cloud communication, big data, and an artificial intelligence platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The object positioning method provided by the present application can also be applied to the application environment as shown in fig. 1. The terminal and the server can independently execute the object positioning model processing method, and the terminal and the server can also cooperatively execute the object positioning model processing method. For example, the server may transmit the trained object localization model to the terminal according to a request of the terminal, and the terminal performs object localization on the input image based on the trained object localization model.

In one embodiment, as shown in fig. 2, an object location model processing method is provided, which is described by taking the method as an example applied to a computer device, the computer device being the terminal 102 or the server 104 in fig. 1, and the object location model processing method includes the following steps:

at step 202, a training sample image including a target object is acquired.

The target object is an object that can be located in the image, and specifically may be an independent living body or object, such as a natural person, an animal, a vehicle, a virtual character, and the like, or may be a specific part, such as a face, a hand, and the like. The training sample image is used for carrying out supervised training on the object positioning model, and due to the fact that the supervised training is carried out, a target object labeling area and a target object labeling type exist in the training sample image. Wherein the label category refers to a category of the target object.

It should be noted that the positioning in this application refers to determining the region where the target object is located from the image. In one embodiment, the positioning may be to detect the target object from a single image, where the target object is a detection object. In another embodiment, the localization may be a recognition of a target object from a sequence of video frames, in which case the target object is an object to be recognized.

Specifically, when the target object is a detection object, the computer equipment can acquire an image containing the target object in a mode of photographing the target object, and the image containing the target object can be used as a training sample image after being manually marked out of the area where the target object is located; the computer device can also acquire an image of an area where the target object is marked and the target object is located from a third-party computer device in a wired or wireless mode to serve as a training sample image.

When the target object is an object to be identified, the computer equipment can acquire a video frame sequence containing the target object in a video shooting mode of the target object, and the video frames in the video frame sequence containing the target object can be used as training sample images after being manually marked out of the region where the target object is located; the computer equipment can also acquire a video frame sequence which comprises the target object and marks the area where the target object is located from third-party computer equipment in a wired or wireless mode, and one frame of image is selected from the video frame sequence to serve as a training sample image.

And 204, determining a regression area corresponding to the training sample image based on the image characteristics of the training sample image and the regression network of the object positioning model.

An object localization model refers to a machine learning model that can target object localization for a given sequence of images or video frames. The object location model includes a regression network and a classification network. The regression network is used for carrying out boundary regression and determining a regression area corresponding to the training sample image. The regression region herein refers to a region obtained by performing boundary regression. The regression region may be one or more. A plurality here means at least two. The classification network is used for classifying the image content in the regression region, and the classification result may be "target object exists" or "target object does not exist". The classification network may be a general network that can be used for classification, and may be, for example, SVM (support vector machines).

The image features of the training sample image may include texture features, color features, gradient features, spatial relationship features, and the like. Texture features describe the surface properties of objects in an image. The color features describe the color of each object in the image. The gradient features describe the shape and structure of objects in the image. The spatial relationship characteristic refers to the mutual spatial position or relative direction relationship among a plurality of targets segmented from the image, and these relationships can also be divided into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like.

Specifically, the computer device may extract image features from the training sample image, and perform boundary regression based on the image features of the training sample image and the regression network of the object location model, thereby determining a regression region corresponding to the training sample image.

In one embodiment, the object location model further comprises a feature extraction network, and the computer device inputs the training sample image into the feature extraction network of the object location model to obtain the image features of the training sample image.

In one embodiment, when the target object is a detection object, after extracting the image features of the training sample image, the computer device may input the image features into a regression network of the object location model to determine a regression region corresponding to the training sample image.

In another embodiment, when the target object is an object to be identified, the computer device, when acquiring a training sample image including the target object, also simultaneously acquires a training reference image including the same target object as the training sample image, determines a first cross-correlation training feature between the training sample image and the training reference image based on an image feature of the training sample image and an image feature of the training reference image, and inputs the first cross-correlation training feature into a regression network to determine a regression region corresponding to the training sample image.

And step 206, calculating regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image, and calculating regression loss based on the regression accuracy.

The target object labeling area (GT) is an area where a target object is marked on the training sample image. The regression accuracy is used for representing the position accuracy of the regression region, and the regression accuracy and the position accuracy are in positive correlation, namely the larger the regression accuracy is, the more accurate the position of the regression region is. The regression loss is used for representing the difference between the target object labeling area and the regression area, and the larger the difference between the target object labeling area and the regression area is, the larger the regression loss is.

Specifically, the computer device may determine a regression accuracy of the regression region based on a degree of coincidence between the regression region and the target object labeling region, and further calculate a regression loss based on the regression accuracy.

It is to be understood that when the training sample image corresponds to a plurality of regression regions, the computer device may determine the regression accuracy of each regression region based on the degree of coincidence between each regression region and the target object labeling region, respectively. Accordingly, the computer device may calculate the regression loss for each regression region based on the regression accuracy for each regression region separately.

In one embodiment, the computer may obtain an intersection region between the target object labeling region and the regression region, and obtain a union region between the target object labeling region and the regression region, and determine the regression accuracy according to a ratio of the intersection region to the union region corresponding to the regression region.

In one embodiment, the computer, after calculating the regression accuracy, may calculate the regression loss by (1-regression accuracy).

And step 208, determining the classification confidence of the regression region based on the image features of the training sample image and the classification network of the object positioning model, and calculating the classification loss based on the classification confidence.

The classification confidence is used for representing the possibility of existence of the target object in a certain regression region, and the higher the classification confidence of the regression region is, the higher the possibility of existence of the target object in the regression region is, and the more accurate the classification of the regression frame is. The classification loss is used for representing the classification accuracy, and the classification loss and the classification accuracy are in a negative correlation relationship, namely the higher the classification accuracy is, the smaller the classification loss is.

Specifically, the computer device may determine a classification confidence of each regression region based on the image features of the training sample image and a classification network of the object location model when extracting the image features of the training sample image, further determine, for each regression region, whether the regression region is a positive sample or a negative sample, calculate a classification loss based on the classification confidence and a positive sample label when the regression region is a positive sample, and calculate a classification loss based on the classification confidence and a negative sample label when the regression region is a negative sample.

In one embodiment, when the target object is a detection object, the computer device may input image features of the training sample image into a classification network of the object localization model to determine a classification confidence for each regression region corresponding to the training sample image.

In another embodiment, when the target object is an object to be recognized, as can be seen from the description of the above embodiment, when the computer device acquires the training sample image including the target object, it also acquires the training reference image including the same target object as the training sample image, and then the computer device may determine a second cross-correlation training feature between the training sample image and the training reference image based on the image features of the training sample image and the image features of the training reference image, and input the second cross-correlation training feature into the classification network to determine the classification confidence of each regression region corresponding to the training sample image. It can be understood that the second cross-correlation training feature is generally a different feature from the first cross-correlation training feature of the above embodiment, and the pixel points at the same positions of the first cross-correlation training feature and the second cross-correlation training feature correspond to the same feature in the training sample image.

Step 210, the classification loss is updated based on the regression accuracy and the regression loss is updated based on the classification confidence.

Specifically, the computer device may respectively and correspondingly multiply the regression accuracy of each regression region with the classification loss of each regression region to update the classification loss of each regression region, thereby establishing a regression auxiliary connection, so that the classification confidence of the regression region with accurate regression is higher, and the computer device may respectively and correspondingly multiply the classification confidence of each regression region with the regression loss of each regression region to update the regression loss of each regression region, thereby establishing a classification auxiliary connection, so that the regression region with higher classification confidence may regress more accurately. By means of the regression auxiliary connection and the classification auxiliary connection, inconsistency between the classification network and the regression network is reduced.

For example, assuming that a training sample image corresponds to two regression regions, a and B, respectively, then for the regression regions a and B, the loss update is shown in table 1 below;

TABLE 1

	Regression region A	Regression region B
			Accuracy of regression	X1	X2
Loss of return	V1	V2
			Confidence of classification	Y1	Y2
Loss of classification	W1	W2
			Updated regression loss	Y1*V1	Y2*V2
Updated classification loss	X1*W1	X2*W2

And 212, training the object positioning model according to the updated classification loss and the updated regression loss until the training stopping condition is met, and obtaining the trained object positioning model.

Specifically, the computer device may superimpose the updated classification loss and the updated regression loss to obtain a synthetic loss, and then train the object location model according to the synthetic loss until a training stop condition is satisfied, to obtain the trained object location model. In the training process, a random Gradient descent algorithm, an adacad (Adaptive Gradient) algorithm, an adaelta (improved Adagrad algorithm), an RMSprop (improved Adagrad algorithm), an Adam (Adaptive Moment Estimation) algorithm and the like can be used for adjusting network parameters of the object positioning model.

The trained object positioning model is used for carrying out object positioning on the input image. In one embodiment, after the computer device acquires the input image, the computer device may determine multiple regression regions of the input image based on the image features of the input image and the regression network of the trained object location model, and determine the classification confidence of each regression region based on the image features of the input image and the classification network of the trained object location model, and the computer device selects the regression region with the highest classification confidence as a target region corresponding to a target object in the input image, that is, a region where the predicted target object is located.

The object positioning model processing method comprises the steps of firstly obtaining a training sample image comprising a target object, determining a regression region corresponding to the training sample image based on image characteristics of the training sample image and a regression network of an object positioning model, calculating regression accuracy of the regression region based on a target object labeling region corresponding to the training sample image, calculating regression loss based on the regression accuracy, determining classification confidence of the regression region based on image characteristics of the training sample image and a classification network of the object positioning model, calculating the classification loss based on the classification confidence, further updating the classification loss based on the regression accuracy, updating the regression loss based on the classification confidence, and finally training the object positioning model according to the updated classification loss and the updated regression loss until a training stopping condition is met to obtain a trained object positioning model, because the losses of the regression network and the classification network are respectively updated in the training process, the classification confidence coefficient is considered in the regression loss of the regression network, and the regression accuracy is considered in the classification loss of the classification network, so that after the training target positioning model is lost after updating, the inconsistency between the classification network and the regression network is relieved in the obtained trained target positioning model, the classification confidence coefficient of a regression region with accurate regression is higher, meanwhile, the regression region with higher classification confidence coefficient regresses more accurately, and the positioning accuracy is improved.

In one embodiment, an object localization model processing method is provided, comprising the following steps 1.1-1.8:

1.1, acquiring a training sample image comprising a target object.

1.2, determining a regression area corresponding to the training sample image based on the image characteristics of the training sample image and the regression network of the object positioning model.

1.3, calculating the regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image, and calculating the regression loss based on the regression accuracy.

And 1.4, determining the classification confidence of the regression region based on the image features of the training sample image and the classification network of the object positioning model, and calculating the classification loss based on the classification confidence.

1.5, updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence.

1.6, determining the regression accuracy prediction value of the regression area based on the regression accuracy prediction network of the object positioning model.

Specifically, in this embodiment, the object location model further includes a regression accuracy prediction network, and the regression accuracy prediction network is configured to predict the regression accuracy of each regression region.

In one embodiment, when the target object is a detection object, the computer device may input the image features of the training sample image into the regression accuracy prediction network to obtain the regression accuracy prediction values of the respective regression regions.

In another embodiment, when the target object is an object to be recognized, as can be seen from the above embodiments, the computer device may determine a first cross-correlation training feature between the training sample image and the training reference image based on the image features of the training sample image and the image features of the training reference image, and then the computer device may input the first cross-correlation training feature into the regression accuracy prediction network to obtain the regression accuracy prediction value of each regression region.

And 1.7, determining the regression accuracy loss based on the regression accuracy predicted value of the regression area and the regression accuracy of the regression area.

The regression accuracy loss is used for representing the difference between the regression accuracy predicted value and the actual regression accuracy of the regression area, the larger the difference is, the larger the regression accuracy loss is, and otherwise, the smaller the difference is, the smaller the regression accuracy loss is.

1.8, training the object positioning model according to the regression accuracy loss, the updated classification loss and the updated regression loss.

Specifically, the computer device may superimpose the regression accuracy loss, the updated classification loss, and the updated regression loss to obtain a composite loss based on which the object location model is trained.

In one embodiment, after the computer device acquires the input image, the computer device may determine a plurality of regression regions of the input image based on the image features of the input image and the regression network of the trained object location model, determine a regression accuracy prediction value of each regression region based on the image features of the input image and the regression accuracy prediction network of the trained object location model, and determine a classification confidence of each regression region based on the image features of the input image and the classification network of the trained object location model, multiply the classification confidence of each regression region with the regression accuracy prediction value to obtain a target location score of each regression region, and use the regression region with the highest target location score as a target region corresponding to a target object in the input image, that is, a region where the predicted target object is located.

In the above embodiment, the computer device trains the object location model through the regression accuracy loss, the updated classification loss, and the updated regression loss, on one hand, because the losses of the regression network and the classification network are respectively updated in the training process, the classification confidence is considered in the regression loss of the regression network, and the regression accuracy is considered in the classification loss of the classification network, so that the inconsistency between the classification network and the regression network is reduced, on the other hand, because the regression accuracy loss of the regression accuracy prediction network is increased in the loss, the regression accuracy is further considered in the obtained trained object location model, and thus the location accuracy is further improved.

In an embodiment, an object location model processing method is provided, where in the embodiment, a target object is an object to be recognized, and the object location model processing method specifically includes the following steps 2.1 to 2.11:

and 2.1, acquiring a training sample image comprising the target object and a training reference image corresponding to the training sample image.

Wherein the training reference image and the training sample image comprise the same target object.

In one embodiment, the computer device may obtain an image frame sequence of a region where the target object is located, obtain a frame of image from the image frame sequence that is earlier in time as a training reference image, and obtain a frame of image that is later in time than the training reference image as a training sample image. The image frame sequence is a set of a series of images in chronological order, and specifically may be a segment of video frame sequence in a video, or may be a multi-frame image continuously acquired by an image acquisition device.

2.2, acquiring the image characteristics of the training sample image and the image characteristics of the training reference image based on the characteristic extraction network of the object positioning model.

The feature extraction network is used for training a machine learning model with image feature extraction capability. The extracted image features of the general machine learning model with the image feature extraction capability, such as the ResNet-50 model, the VGG16, the MobileNetV2, and the like, meet the requirements of the object positioning method provided by the present application on the image features, and the general machine learning model with the image feature extraction capability can be used as a feature extraction network of the object positioning method provided by the present application.

In one embodiment, since the computer device needs to extract features from the training sample image and the training reference image at the same time, a twin Network (Simase Network) may be used as the feature extraction Network, and the feature extraction networks in the object localization model are two feature extraction networks with the same model structure and sharing model parameters. In this embodiment, the step 2.2 specifically includes: respectively inputting a training sample image and a training reference image into two feature extraction networks; and outputting the image characteristics of the training sample image and the image characteristics of the training reference image in parallel and respectively through the two characteristic extraction networks.

In a specific embodiment, the image feature of the training sample image may be a first image feature map, and the training reference image may be a second image feature map. The resolution of the first image feature map is different from that of the training sample image, and there is a correspondence between pixel points in the first image feature map and pixel points in the training sample image, where the correspondence is related to feature extraction parameters (such as convolution kernel size, step size, and the like) used by the feature extraction network. Similarly, the resolution of the second image feature map is different from the resolution of the training reference image, and a corresponding relationship exists between the pixel points in the second image feature map and the pixel points in the training reference image, and the corresponding relationship is also related to the feature extraction parameters used by the feature extraction network.

Specifically, two feature extraction networks are provided, and the two feature extraction networks are completely consistent in model structure and model parameters. The computer equipment respectively inputs the training sample image and the training reference image into the two feature extraction networks, so that the two feature extraction networks respectively extract the image features of the training sample image and the second image features of the training reference image.

For example, referring to fig. 3, fig. 3 is a schematic structural diagram of an object location model in an embodiment. It can be seen that the training sample image 302 and the training reference image 304 are respectively input into two feature extraction networks, one of which outputs the image features of the training sample image, and the other of which outputs the image features of the training reference image.

In the embodiment, the two feature extraction networks extract the image features in parallel, so that the extraction efficiency of the image features is improved, and the efficiency of positioning the target object is further improved.

And 2.3, determining a first cross-correlation training feature between the training sample image and the training reference image based on the image feature of the training sample image and the image feature of the training reference image.

The first cross-correlation training feature, that is, the first cross-correlation feature mentioned in the above embodiments, is referred to as the first cross-correlation training feature in this embodiment because the first cross-correlation training feature is a feature determined in the process of training the object positioning model. Similarly, the second cross-correlation feature mentioned in the above embodiments is referred to as a second cross-correlation training feature in the present embodiment. The cross-correlation features fuse the image features of the training sample image and the image features of the training reference image, and can be used for representing the similarity degree between the training sample image and the training reference image.

In one embodiment, the step 2.3 specifically includes: respectively performing first convolution operation on the image features of the training sample image and the image features of the training reference image based on the object positioning model to obtain first training intermediate features of the training sample image and first reference intermediate features of the training reference image; and performing cross-correlation operation on the first training intermediate feature and the first reference intermediate feature to obtain a first cross-correlation training feature between the training sample image and the training reference image.

The image features of the training sample image and the image features of the training reference image may both form a feature map, and the first convolution operation may specifically be to input the feature map of the training sample image and the feature map of the training reference image into two independent convolution layers in the target object positioning network, respectively, and perform dimension reduction on the feature map of the training sample image and the feature map of the training reference image through the two convolution layers to reduce the size of the feature maps, so as to obtain a first training intermediate feature of the training sample image and a first reference intermediate feature of the training reference image. It will be appreciated that the first training intermediate feature and the first reference intermediate feature are both in the form of feature maps.

The cross-correlation operation refers to a correlation convolution operation (convolution) performed in a certain quantization range. Specifically, after obtaining a first training intermediate feature of a training sample image and a first reference intermediate feature of a training reference image, the computer device performs channel-by-channel cross-correlation convolution on the first training intermediate feature and the first reference intermediate feature to obtain a first cross-correlation training feature between the training sample image and the training reference image.

And 2.4, inputting the first cross-correlation training characteristics into a regression network to determine a regression area corresponding to the training sample image.

And 2.5, calculating the regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image, and calculating the regression loss based on the regression accuracy.

In one embodiment, the regression accuracy of the regression region may be determined by calculating IoU (Intersection-over-unity ratio) between the regression region and the target object labeling region corresponding to the training sample image. IoU, the degree of coincidence between the target object labeling region and the regression region can be reflected, the higher the degree of coincidence, the higher the regression accuracy, so the IOU can be used as the regression accuracy of the regression region. Specifically, the computer equipment acquires an intersection region between a target object labeling region and a regression region; acquiring a union region between a target object labeling region and a regression region; and determining regression accuracy according to the area ratio of the intersection area and the union area corresponding to the regression area. Wherein the area ratio is IoU, specifically referring to the following formula (1):

IoU ═ area (A ≠ B)/area (A ≡ B) formula (1)

Wherein area represents the area, A refers to the target object labeling area, and B refers to the regression area. "U" refers to intersection and "U" refers to union.

Fig. 4 is a schematic diagram illustrating positions of a target object labeling area and a regression area in a specific embodiment. Box a represents the position of the target object labeling area in the training sample image. Box B represents the location of the regression region in the training sample image. As can be seen from fig. 4, the overlapping portion of a and B, i.e., the intersection, occupies 6 pixels (the pixels in the 5 th row, the 4 th column to the 6 th column, and the pixels in the 6 th row, the 4 th column to the 6 th column), and the union of a and B occupies 18 pixels, so that IoU is 6/18 — 0.33.

After computing IoU between the regression region and the target object annotation region corresponding to the training sample image, the computer device may compute IoU loss based on IoU. In one embodiment, the regression loss may be calculated by-ln (IoU). In another embodiment, the regression loss may also be calculated by 1-IoU.

And 2.6, inputting the first cross-correlation training characteristics into a regression accuracy prediction network to determine a regression accuracy prediction value of the regression area.

And 2.7, determining the regression accuracy loss based on the regression accuracy predicted value of the regression area and the regression accuracy of the regression area.

With reference to fig. 3, after obtaining the image features of the training sample image and the training reference image, the computer device inputs the image features of the training sample image into the first convolution layer, and inputs the image features of the training reference image into the second convolution layer, where the first convolution layer performs convolution operation on the image features of the training sample image to obtain a first training intermediate feature of the training sample image, the second convolution layer performs convolution operation on the image features of the training reference image to obtain a first reference intermediate feature of the training reference image, then performs cross-correlation operation on the first training intermediate feature and the first reference intermediate feature to obtain a first cross-correlation training feature, and inputs the first cross-correlation training feature into the regression network and the regression accuracy prediction network, respectively.

And 2.8, determining a second cross-correlation training feature between the training sample image and the training reference image based on the image feature of the training sample image and the image feature of the training reference image.

In one embodiment, the computer device may perform a second convolution operation on the image features of the training sample image and the image features of the training reference image respectively based on the object positioning model to obtain second training intermediate features of the training sample image and second reference intermediate features of the training reference image; and performing cross-correlation operation on the second training intermediate feature and the second reference intermediate feature to obtain a second cross-correlation training feature between the training sample image and the training reference image.

The second convolution operation may specifically be that the feature map of the training sample image and the feature map of the training reference image are input into another two independent convolution layers in the target object positioning network, and the feature map of the training sample image and the feature map of the training reference image are subjected to dimension reduction through the two convolution layers to reduce the size of the feature maps, so as to obtain a second training intermediate feature of the training sample image and a second reference intermediate feature of the training reference image. It is to be understood that the second training intermediate feature and the second reference intermediate feature herein are both in the form of feature maps. It will also be appreciated that the parameters of the two separate convolutional layers may be the same or different from those of the two separate convolutional layers of the above embodiments, and the resulting characteristics may be the same or different.

And after the computer equipment obtains the second training intermediate feature of the training sample image and the second reference intermediate feature of the training reference image, performing channel-by-channel cross-correlation convolution on the second training intermediate feature and the second reference intermediate feature to obtain a second cross-correlation training feature between the training sample image and the training reference image.

With continued reference to fig. 3, the computer device inputs the image features of the training sample image into a third convolutional layer, and simultaneously inputs the image features of the training reference image into a fourth convolutional layer, the third convolutional layer performs convolution operation on the image features of the training sample image to obtain second training intermediate features of the training sample image, and the fourth convolutional layer performs convolution operation on the image features of the training reference image to obtain second reference intermediate features of the training reference image. And then performing cross-correlation operation on the second training intermediate feature and the second reference intermediate feature to obtain a second cross-correlation training feature, and inputting the second cross-correlation training feature into a classification network.

And 2.9, inputting the second cross-correlation training features into a classification network to determine the classification confidence of the regression region, and calculating the classification loss based on the classification confidence.

In one embodiment, the computer device may calculate the Focal local (focus Loss) as the classification Loss based on the classification confidence. The calculation formula of Focal local is as follows, where α is a balance factor for balancing positive and negative samples, y is a sample label, and y' is a classification confidence:

based on the above formula, in this embodiment, it is first required to determine whether the regression region is a positive sample or a negative sample. In this embodiment, the computer device may obtain, by calculating IoU between the regression region and the target object labeling region, an intersection region between the target object labeling region and the regression region, obtain a union region between the target object labeling region and the regression region, and obtain an area ratio (specifically, refer to the IOU calculation formula in the above embodiment) of the intersection region and the union region corresponding to the regression region, where the area ratio is IoU between the regression region and the target object labeling region.

Further, for each regression region, when IoU corresponding to the regression region is greater than a first preset threshold, the sample can be determined to be a positive sample, and a classification Loss is determined according to a positive sample label and a classification confidence corresponding to the training sample image, where the positive sample label is 1, and therefore, the classification Loss of the positive sample can specifically refer to a formula in which y is 1 in the Focal local calculation formula (2); for each regression region, when IoU corresponding to the regression region is smaller than the second preset threshold, the sample can be determined to be a negative sample, and the classification Loss is determined according to the negative sample label and the classification confidence corresponding to the training sample image, where the negative sample label is 0, and therefore, the classification Loss of the negative sample can be referred to the formula where y is 0 in the Focal local calculation formula (2). The first preset threshold and the second preset threshold may be set empirically, for example, the first preset threshold is set to 0.7, and the second preset threshold is set to 0.3.

In one embodiment, when the classification confidence of a certain regression region is greater than the second preset threshold and smaller than the first preset threshold, the regression region is a sample with ambiguous classification, and for such a sample, the significance for training the classification network is not great, and in order to improve the training efficiency and accuracy, the classification loss may not be calculated.

2.10, updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence.

In one embodiment, the regression accuracy Loss may be a BCE Loss (Binary Cross Entropy Loss) calculated as the following formula (3), where N is the number of training sample images in a batch, y is the number of training sample images in a batch_iTo represent the label of the training sample image i, the positive class is 1, the negative class is 0, p_iProbability of being predicted positive for training sample image i:

BCE Loss is commonly used with sigmoid functions, which are shown in equation (4) below:

with continued reference to fig. 3, it can be seen that the computer device calculates the cross-correlation loss as the regression loss according to the regression accuracy of the regression branch after inputting the first cross-correlation training feature into the regression network and the regression accuracy prediction network, respectively, and after inputting the second cross-correlation training feature into the classification network, and updates the cross-correlation loss with the classification confidence of the classification branch through the classification auxiliary connection a; calculating the focus loss as a classification loss according to the classification confidence of the classification branch, and updating the focus loss by using the regression accuracy of the regression branch through a regression auxiliary connection B; and calculating the binary cross entropy loss as the regression accuracy loss according to the regression accuracy of the regression accuracy prediction network and the regression accuracy of the regression branch.

And 2.11, training the object positioning model according to the regression accuracy loss, the updated classification loss and the updated regression loss until the training stopping condition is met, and obtaining the trained object positioning model.

In the embodiment, when the target object is an object to be identified, the features of the training sample image and the training reference image are respectively extracted through the twin network, the cross-correlation feature between the training sample image and the training reference image is obtained, the obtained cross-correlation feature fuses the features of the training sample image and the training reference image, and the similarity between the training sample image and the training reference image can be reflected, so that the target identification problem can be converted into the similarity comparison problem, and the target object can be quickly and accurately located from the input image when the target object locating network obtained through training carries out target locating.

In one embodiment, determining the regression region corresponding to the training sample image based on the image features of the training sample image and the regression network of the object localization model comprises: determining a predicted central point position corresponding to a target object in a training sample image and a predicted size corresponding to the target object based on the image characteristics of the training sample image and a regression network of an object positioning model; and determining a regression area according to the position of the predicted central point and the predicted size.

Specifically, in the present embodiment, the regression region is determined based on the policy of anchor-free. The computer device may determine each pixel point position in the feature map input into the regression network as a predicted central point position, for each predicted central point position, the computer device further determines a predicted size corresponding to the target object based on the predicted central point position, and determines an bounding box constructed by the predicted central point position and the predicted size corresponding to the predicted central point position as the regression region.

And the position of the predicted central point corresponding to the target object is the position of the predicted central point of the target object. The center point of the target object may be a center point of a smallest bounding box that the target object can enclose, and the bounding box may be a polygon, a circle, and the like, and the center point of the bounding box may be, for example, a diagonal intersection of the polygon, a center of the circle, and the like. For example, referring to fig. 5, taking the virtual object as an example, the central point may be a diagonal intersection 502 of a smallest quadrangle that the virtual object can enclose.

The predicted size corresponding to the target object may be the size of the smallest bounding box that the target object can enclose, such as the width and height of a quadrangle, the diameter of a circle, and the like. With continued reference to FIG. 5, it can be seen that the dimensions may be the width 504 and height 506 of the smallest quadrilateral that the virtual object can enclose.

In the embodiment, through the anchor-free strategy, the complicated step of setting the prior frame is avoided, and the efficiency of determining the regression region is improved, so that the efficiency of model training is improved.

In one embodiment, determining the regression region corresponding to the training sample image based on the image features of the training sample image and the regression network of the object location model comprises: acquiring the position of an anchor frame corresponding to a training sample image; determining the offset corresponding to the anchor frame based on the image characteristics of the training sample image and the regression network of the object positioning model; the regression region is determined based on the location of the anchor frame and the offset of the anchor frame.

In this embodiment, a policy based on anchor is used to determine the regression region. A plurality of anchor frames are set in advance in the training sample image, and the offset (offset) corresponding to each anchor frame is determined based on the image features of the training sample image and the regression network of the object localization model, and since the position of each anchor frame is fixed and known, a plurality of regression regions may be determined by the computer device based on the position of each anchor frame and the respective offset of each anchor frame.

In one embodiment, the method further includes a target recognition step, and the target recognition step specifically includes:

1. The method comprises the steps of obtaining a video to be identified, determining a first frame of video frame including an object to be identified in the video to be identified, obtaining a reference image to be identified, and sequentially obtaining video frames from a video frame next to the first frame of video frame to serve as the image to be identified.

The video to be identified may be recorded video data or video data recorded in real time.

Specifically, a first frame of video frame of the video to be recognized is used as a reference image, and the objects to be recognized are sequentially positioned in each subsequent frame of video frame, so that the objects to be recognized are recognized in the video to be recognized.

2. The method comprises the steps of inputting an image to be recognized and a reference image to be recognized into an object positioning model respectively, and obtaining a first cross-correlation recognition characteristic and a second cross-correlation recognition characteristic between the image to be recognized and the reference image to be recognized based on the object positioning model.

The first cross-correlation identification feature and the second cross-correlation identification feature are cross-correlation features, and are referred to as cross-correlation identification features herein for distinguishing from cross-correlation features obtained when the object positioning model is trained in the above embodiments, and a specific obtaining manner of the cross-correlation identification features may refer to the description in the above embodiments, which is not described herein again.

3. The first cross-correlation identification features are input into a regression network of the object localization model to determine a plurality of regression regions of the image to be identified.

4. And inputting the second cross-correlation identification features into a classification network of the object positioning model to determine the classification confidence of each regression region of the image to be identified.

5. And positioning the object to be recognized from the image to be recognized according to the classification confidence of each regression region of the image to be recognized.

In one embodiment, the computer device may select the regression region with the highest classification confidence to be determined as the region where the object to be recognized is located in the image to be recognized, so as to locate the object to be recognized from the image to be recognized.

In another embodiment, the method further includes a target identification step, and the target identification step specifically includes:

1. the method comprises the steps of obtaining a video to be identified, determining a first frame of video frame including an object to be identified in the video to be identified, obtaining a reference image to be identified, and sequentially obtaining video frames from a video frame next to the first frame of video frame to be used as the image to be identified.

2. The method comprises the steps of respectively inputting an image to be recognized and a reference image to be recognized into an object positioning model, and obtaining a first cross-correlation recognition feature and a second cross-correlation recognition feature between the image to be recognized and the reference image to be recognized based on the object positioning model.

4. And inputting the first cross-correlation identification characteristics into a regression accuracy prediction network of the object positioning model to obtain regression accuracy prediction values corresponding to all regression areas of the image to be identified.

5. And inputting the second cross-correlation identification features into a classification network of the object positioning model to determine the classification confidence of each regression region of the image to be identified.

6. And correspondingly multiplying the regression accuracy predicted value of each regression region of the image to be recognized with the classification confidence coefficient to obtain a target positioning score corresponding to each regression region of the image to be recognized.

For example, it is assumed that two regression regions corresponding to the image to be recognized are a regression region a and a regression region B, respectively, the regression accuracy prediction value of the regression region a is a1 and the classification confidence is a2, the regression accuracy prediction value of the regression region B is B1 and the classification confidence is B2, the localization score of the regression region a is a1 a2, and the localization score of the regression region B is B1B 2.

7. And positioning the object to be recognized from the image to be recognized according to the target positioning scores corresponding to the regression regions of the image to be recognized.

In an embodiment, as shown in fig. 6, an object positioning method is provided, which is described by taking an example that the method is applied to a computer device, where the computer device may be the terminal 102 or the server 104 in fig. 1, and the method specifically includes the following steps:

step 602, an input image is acquired.

And step 604, acquiring the trained object positioning model.

Acquiring a trained object positioning model; the object positioning model comprises a classification network and a regression network; the object positioning model is obtained through the training of target regression loss of a regression network and target classification loss of a classification network; the target regression loss is obtained by updating the initial regression loss through the classification confidence coefficient of the regression region; the target classification loss is obtained by updating the initial classification loss through the regression accuracy of the regression region; the regression region is determined based on the image features of the training sample images and a regression network of the object positioning model; the regression accuracy of the regression region is calculated based on a target object labeling region corresponding to the training sample image; the classification confidence of the regression region is determined based on the image features of the training sample images and the classification network; the initial classification loss is calculated based on the classification confidence; the initial regression loss is calculated based on the regression accuracy.

Step 606, determining a plurality of regression regions of the input image based on the image features of the input image and the regression network.

Step 608, determining a classification confidence of each regression region of the input image based on the image features of the input image and the classification network.

And step 610, positioning the target object from the input image according to the classification confidence of each regression region of the input image to obtain the region where the target object is located.

For the explanation of the above steps 602-610, reference may be made to the description in the above embodiments, which are not repeated herein.

According to the object positioning method, the object positioning model is obtained through the training of the target regression loss of the regression network and the target classification loss of the classification network, in the training process, the losses of the regression network and the classification network are updated respectively, so that the classification confidence coefficient is considered in the regression loss of the regression network, and the regression accuracy is considered in the classification loss of the classification network, therefore, after the target positioning model is trained through the updated loss, the inconsistency between the classification network and the regression network is relieved in the trained target positioning model, the classification confidence coefficient of a regression region with accurate regression is higher, meanwhile, the regression region with higher classification confidence coefficient regresses more accurately, and the positioning accuracy is improved.

In one embodiment, the target object is an object to be identified; acquiring the input image includes: acquiring an input image and a reference image corresponding to the input image; the reference image is an image of a target object in one frame before the input image in the image frame sequence in which the input image is positioned; determining a plurality of regression regions of the input image based on the image features of the input image and the regression network comprises: determining a first cross-correlation feature between the input image and the reference image based on the image features of the input image and the image features of the reference image; inputting the first cross-correlation features into a regression network to determine a plurality of regression regions of the input image; determining a classification confidence for each regression region of the input image based on the image features of the input image and the classification network comprises: determining a second cross-correlation feature between the input image and the reference image based on the image features of the input image and the image features of the reference image; the second cross-correlated features are input to a classification network to determine classification confidences for respective regression regions of the input image.

The current image and the input image are both one frame image in the image frame sequence. The image frame sequence is a set of a series of images in chronological order, and specifically may be a video frame sequence in a video, or may be a plurality of frames of images continuously acquired by an image acquisition device. The current image and the input image comprise the same target object, the target object in the current image is to be positioned, and the target object in the input image is positioned. The target object is an object to be recognized in the image frame sequence, and may be an independent living body or an object, such as a natural person, an animal, a vehicle, a virtual character, or the like, or a specific part, such as a face, a hand, or the like.

It will be appreciated that since the current image and the input image are from the same sequence of image frames and the input image is an image of a frame processed prior to the current image, the target object to be identified may be selected by the input image and located in a subsequent image frame of the input image, enabling identification of the target object in the sequence of image frames.

Specifically, the input image may be a first frame image of the image frame sequence, or may be an intermediate frame image of the image frame sequence. When the input image is the first frame image, the target object can be positioned based on user operation, so that the target object to be recognized can be acquired through the input image in the subsequent image frame. And when the input image is an intermediate frame image, the input image may be a previous frame image of the current image, so that the current image may acquire the target object to be recognized through the target object positioned in the previous frame image.

In a specific embodiment, the input image may be a complete image of a frame in the image frame sequence, or may be a body region cut out from the input image based on the position of the target object. And taking the position of the target object as the center, and taking the area in the designated range as the main body area, so that the target object to be identified can be quickly acquired by the subsequent image frame.

In a specific embodiment, the input image may be a complete image of one frame in the image frame sequence, or may be a region of interest selected according to a result to be identified of a previous image. Considering that the position change of the target object between the front frame and the rear frame is small, the position of the target object in the image of the previous frame is taken as the center, and the input image is intercepted according to the specified search range to obtain the region of interest so as to narrow the search range of the current image.

In one embodiment, a target object positioning application is run on a computer device, the computer device can start the target object positioning application according to user operation, and the target object positioning application acquires a sequence of image frames and extracts an input image and a current image from the sequence of image frames.

In another embodiment, an object localization method is provided, in which an object localization model includes a regression network, a classification network, and a regression accuracy prediction network. The object location model is obtained by training regression accuracy loss of a regression accuracy prediction network, target regression loss of the regression network, and target classification loss of the classification network. The target regression loss of the regression network and the target classification loss of the classification network are obtained in the same manner as in the above embodiment, and the regression accuracy loss is determined based on the regression accuracy prediction value of the regression region and the regression accuracy of the regression region; the regression accuracy prediction value is determined based on the image features of the training sample images and the regression accuracy prediction network.

Referring to fig. 7, the present embodiment specifically includes the following steps:

step 702, acquiring an input image and a reference image corresponding to the input image; the reference image is an image of a target object included in one frame preceding the input image in the image frame sequence in which the input image is located.

Step 704, determining a first cross-correlation feature between the input image and the reference image based on the image features of the input image and the image features of the reference image.

Step 706, the first cross-correlation feature is input to a regression network to determine a plurality of regression regions of the input image.

Step 708, inputting the first cross-correlation feature into a regression accuracy prediction network to determine a regression accuracy prediction value for each regression region of the input image.

Step 710, determining a second cross-correlation feature between the input image and the reference image based on the image features of the input image and the image features of the reference image.

Step 712, inputting the second cross-correlation features into a classification network to determine a classification confidence of each regression region of the input image.

And 714, correspondingly multiplying the regression accuracy prediction value of each regression area of the input image with the classification confidence coefficient to obtain the target positioning score of each regression area of the input image.

And 716, performing object positioning on the input image according to the target positioning scores of all the regression regions of the input image to obtain the region where the target object is located.

For example, referring to fig. 8, a schematic diagram of an output result when the object location model is used for target recognition is shown. Referring to fig. 8, after obtaining the first cross-correlation feature and the second cross-correlation feature, the computer device inputs the first cross-correlation feature into the regression network and the regression accuracy prediction network, respectively, to obtain multiple regression regions and an intersection-to-parallel ratio (IoU) prediction value of each regression region, inputs the second cross-correlation feature into the classification network, to obtain a classification confidence of each regression region, multiplies the classification confidence of each regression region by the intersection-to-parallel ratio prediction value, to obtain a location score of each regression region, and finally selects one regression region as a final prediction region according to the location score of the target, and locates the target object 602 in the input image according to the prediction region.

Fig. 9 is an overall flowchart of an object location method in one embodiment. Referring to fig. 9, when the process is started, first, a video frame in which a target object appears in a first frame in a video is determined as a reference image, the video frame is sequentially obtained from a next frame of a training reference image as an input image, image features of the training reference image and image features of a training sample image are respectively extracted by a twin network feature extraction module (i.e., a feature extraction network in the superior embodiment), based on the extracted image features, a plurality of regression regions and confidence scores of the respective regression regions are obtained by a classification regression complementation module (i.e., a classification network and a regression network in the above embodiment), then a IoU predicted value is obtained by an IoU localization module (i.e., a regression accuracy prediction network in the above embodiment), a localization score is determined based on IoU predicted value and the classification confidence, a location region of the target object is determined by the localization score, and whether the video is finished or not is determined, if the video is finished, the flow is finished.

Fig. 10 is a schematic diagram illustrating a result of recognizing a human body in one embodiment. Referring to fig. 10, in a first image frame 110, an area 112 where a target object 111 is located is detected, and in a subsequent image frame of the first image frame 110, the object locating method provided by the embodiment of the present application is used to perform the target object 111 recognition, such as: the area in which the target object 111 is located in the second image frame 120 is 121, the area in which the target object 111 is located in the third image frame 130 is 131, and so on.

It should be noted that the first image frame 110, the second image frame 120, and the third image frame 130 may be three image frames adjacent to each other in sequence; first image frame 110, second image frame 120, and third image frame 130 may also be arranged in sequence, and three image frames of interval frames exist between every two frames.

The application also provides an application scene, and the application scene applies the object positioning model processing method and the object positioning method. In the application scenario, a human body to be recognized needs to be performed on a video shot by a camera device on the automatic driving device, and specifically, the application of the object positioning model processing method and the object positioning method in the application scenario is as follows:

Firstly, a server obtains a training sample image and a training reference image of a human body, then the server obtains a trained object positioning model by executing the steps 2.2-2.11 in the embodiment, and the server sends the object positioning model to a terminal. After acquiring a video containing a human body shot by a camera device, a terminal determines that the video with the human body appearing in a first frame in the video is determined as a reference image, sequentially acquires video frames from a video frame next to the reference image to be used as images to be identified, and for each frame of image to be identified, the terminal positions the area where the human body is located by executing the following steps so as to realize the identification of the human body in the shot video.

1. And determining a first cross-correlation characteristic between the image to be recognized and the reference image based on the image characteristic of the image to be recognized and the image characteristic of the reference image.

2. The first cross-correlation features are input into a regression network to determine a plurality of regression regions for the image to be identified.

3. And inputting the first cross-correlation characteristic into a regression accuracy prediction network to determine regression accuracy prediction values of all regression areas of the image to be recognized.

4. And determining a second cross-correlation characteristic between the image to be recognized and the reference image based on the image characteristic of the image to be recognized and the image characteristic of the reference image.

5. And inputting the second cross-correlation characteristics into a classification network to determine the classification confidence of each regression region of the image to be recognized.

6. And correspondingly multiplying the regression accuracy of each regression region of the image to be recognized with the classification confidence coefficient to obtain the target positioning score of each regression region of the image to be recognized.

7. And positioning the target object from the image to be recognized according to the target positioning scores of all regression regions of the image to be recognized to obtain the region where the target object is located.

It should be understood that although the various steps in the flow charts of fig. 1-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 11, an apparatus 1100 for processing an object localization model is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes:

a training sample acquisition module 1102 for acquiring a training sample image including a target object;

a regression region determining module 1104, configured to determine a regression region corresponding to the training sample image based on the image features of the training sample image and a regression network of the object location model;

a regression loss calculation module 1106, configured to calculate a regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image, and calculate a regression loss based on the regression accuracy;

a classification loss calculation module 1108, configured to determine a classification confidence of the regression region based on the image features of the training sample image and the classification network of the object location model, and calculate a classification loss based on the classification confidence;

an update module 1110 for updating the classification loss based on the regression accuracy and updating the regression loss based on the classification confidence;

a training module 1112, configured to train an object location model according to the updated classification loss and the updated regression loss, and obtain a trained object location model until a training stop condition is met; the trained object positioning model is used for positioning an object of the input image.

In one embodiment, the above apparatus further comprises: a regression accuracy loss determination module for determining a regression accuracy prediction value of the regression region based on a regression accuracy prediction network of the object positioning model; determining regression accuracy loss based on the regression accuracy predicted value of the regression region and the regression accuracy of the regression region; the training module is further configured to train the object-locating model based on the regression accuracy loss, the updated classification loss, and the updated regression loss.

In one embodiment, the target object is an object to be identified; the training sample acquisition module is also used for acquiring a training sample image comprising a target object and a training reference image corresponding to the training sample image; the training reference image and the training sample image comprise the same target object; the regression region determining module is further used for determining a first cross-correlation training feature between the training sample image and the training reference image based on the image feature of the training sample image and the image feature of the training reference image, and inputting the first cross-correlation training feature into a regression network to determine a regression region corresponding to the training sample image; the regression accuracy loss determination module is further used for inputting the first cross-correlation training feature into a regression accuracy prediction network to determine a regression accuracy prediction value of the regression region; the classification loss calculation module is also used for determining a second cross-correlation training feature between the training sample image and the training reference image based on the image feature of the training sample image and the image feature of the training reference image; inputting the second cross-correlation training features into a classification network to determine a classification confidence of the regression region; the first cross-correlation training feature and the second cross-correlation training feature are used for representing the similarity degree between the training sample image and the training reference image.

In one embodiment, the apparatus further comprises: the characteristic extraction module is used for acquiring the image characteristics of the training sample image and the image characteristics of the training reference image based on the object positioning model; the regression region determining module is further configured to perform convolution operation on the image features of the training sample image and the image features of the training reference image respectively based on the object positioning model to obtain a first training intermediate feature of the training sample image and a first reference intermediate feature of the training reference image, and perform cross-correlation operation on the first training intermediate feature and the first reference intermediate feature based on the object positioning model to obtain a first cross-correlation training feature between the training sample image and the training reference image.

In one embodiment, the object localization model comprises two feature extraction networks with the same model structure and sharing model parameters; the feature extraction module is to: respectively inputting a training sample image and a training reference image into two feature extraction networks; and outputting the image features of the training sample image and the image features of the training reference image in parallel and respectively through the two feature extraction networks.

In one embodiment, the regression loss calculation module is configured to obtain an intersection region between the target object labeling region and the regression region; acquiring a union region between a target object labeling region and a regression region; and determining regression accuracy according to the area ratio of the intersection region and the union region corresponding to the regression region.

In one embodiment, the classification loss calculation module is configured to obtain an intersection region between the target object labeling region and the regression region; acquiring a union region between a target object labeling region and a regression region; acquiring the area ratio of an intersection region and a union region corresponding to the regression region; when the area ratio is larger than a first preset threshold value, determining classification loss according to the positive sample label and the classification confidence corresponding to the training sample image; and when the area ratio is smaller than a second preset threshold, determining the classification loss according to the negative sample label corresponding to the training sample image and the classification confidence.

In one embodiment, the regression region determination module is configured to determine a predicted central point position corresponding to the target object and a predicted size corresponding to the target object in the training sample image based on the image features of the training sample image and a regression network of the object location model; and determining a regression area according to the position and the size of the predicted central point.

In one embodiment, the regression region determination module is configured to obtain a position of an anchor frame corresponding to the training sample image; determining the offset corresponding to the anchor frame based on the image characteristics of the training sample image and the regression network of the object positioning model; the regression region is determined based on the location of the anchor frame and the offset of the anchor frame.

In one embodiment, the target object is an object to be identified; the device also comprises a target identification module used for acquiring the video to be identified; determining a first frame video frame including an object to be identified in a video to be identified to obtain a reference image to be identified; sequentially acquiring video frames from a video frame next to a first frame video frame to be used as an image to be identified; respectively inputting an image to be recognized and a reference image to be recognized into an object positioning model; acquiring a first cross-correlation identification characteristic and a second cross-correlation identification characteristic between an image to be identified and a reference image to be identified based on an object positioning model; inputting the first cross-correlation identification features into a regression network of the object location model to determine a plurality of regression regions of the image to be identified; inputting the second cross-correlation identification features into a classification network of the object positioning model to determine the classification confidence of each regression region of the image to be identified; and positioning the object to be recognized from the image to be recognized according to the classification confidence of each regression region of the image to be recognized.

In one embodiment, the target identification module inputs the first cross-correlation identification feature into a regression accuracy prediction network of the object positioning model to obtain a regression accuracy prediction value of each regression region of the image to be identified; correspondingly multiplying the regression accuracy prediction value of each regression region of the image to be recognized with the classification confidence coefficient to obtain a target positioning score of each regression region of the image to be recognized; and positioning the object to be recognized from the image to be recognized according to the target positioning scores of all regression regions of the image to be recognized.

In one embodiment, as shown in fig. 12, there is provided an object positioning apparatus 1200, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, the apparatus specifically includes:

an image acquisition module 1202 for acquiring an input image;

a model obtaining module 1204, configured to obtain a trained object positioning model; the object positioning model comprises a classification network and a regression network; the object positioning model is obtained through the training of target regression loss of a regression network and target classification loss of a classification network; the target regression loss is obtained by updating the initial regression loss through the classification confidence coefficient of the regression region; the target classification loss is obtained by updating the initial classification loss through the regression accuracy of the regression region; the regression region is determined based on the image characteristics of the training sample image and a regression network of the object positioning model; the regression accuracy of the regression region is calculated based on the target object labeling region corresponding to the training sample image; the classification confidence of the regression region is determined based on the image features of the training sample images and the classification network; the initial classification loss is calculated based on the classification confidence; the initial regression loss is calculated based on the regression accuracy;

A regression region determining module 1206 for determining a plurality of regression regions of the input image based on the image features of the input image and the regression network;

a confidence determining module 1208, configured to determine a classification confidence of each regression region of the input image based on the image features of the input image and the classification network;

and the positioning module 1210 is configured to perform object positioning on the input image according to the classification confidence of each regression region of the input image, so as to obtain a region where the target object is located.

In one embodiment, the target object is an object to be identified; the image acquisition module is also used for acquiring an input image and a reference image corresponding to the input image; the reference image is an image of a target object in one frame before the input image in the image frame sequence in which the input image is positioned; the regression region determination module is further used for determining a first cross-correlation characteristic between the input image and the reference image based on the image characteristics of the input image and the image characteristics of the reference image; inputting the first cross-correlation features into a regression network to determine a plurality of regression regions of the input image; the confidence coefficient determining module is further used for determining a second cross-correlation characteristic between the input image and the reference image based on the image characteristic of the input image and the image characteristic of the reference image; the second cross-correlated features are input to a classification network to determine classification confidences for respective regression regions of the input image.

In one embodiment, the object localization model further comprises a regression accuracy prediction network; the object positioning model is obtained by the regression accuracy loss of the regression accuracy prediction network, the target regression loss of the regression network and the target classification loss training of the classification network; the regression accuracy loss is determined based on the regression accuracy prediction value of the regression region and the regression accuracy of the regression region; the regression accuracy prediction value is determined based on the image characteristics of the training sample image and a regression accuracy prediction network; the method further comprises a regression accuracy prediction module, which is used for inputting the first cross-correlation characteristic into a regression accuracy prediction network so as to determine a regression accuracy prediction value of each regression area of the input image; the positioning module is also used for correspondingly multiplying the regression accuracy of each regression region of the input image by the classification confidence coefficient to obtain a target positioning score of each regression region of the input image; and positioning the target object from the input image according to the target positioning scores of all regression regions of the input image to obtain the region where the target object is located.

For the specific limitations of the object location model processing apparatus and the object location apparatus, reference may be made to the limitations of the object location model processing method and the object location method in the foregoing, which are not described herein again. The modules in the object location model processing device and the object location device may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device may be used to store training sample image data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an object localization model processing method or an object localization method.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of the above-described method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of processing an object localization model, the method comprising:

acquiring a training sample image including a target object;

training the object positioning model according to the updated classification loss and the updated regression loss until the training stopping condition is met, and obtaining a trained object positioning model;

2. The method of claim 1, wherein prior to said training the object localization model based on the updated classification loss and the updated regression loss, the method further comprises:

determining a regression accuracy prediction value for the regression region based on a regression accuracy prediction network of the object location model;

determining a regression accuracy loss based on the regression accuracy prediction value of the regression region and the regression accuracy of the regression region;

the training the object localization model according to the updated classification loss and the updated regression loss comprises:

Training the object localization model according to the regression accuracy loss, the updated classification loss, and the updated regression loss.

3. The method according to claim 2, characterized in that the target object is an object to be identified; the acquiring a training sample image containing a target object comprises:

acquiring a training sample image comprising a target object and a training reference image corresponding to the training sample image; the training reference image and the training sample image comprise the same target object;

the determining a regression region corresponding to the training sample image based on the image features of the training sample image and the regression network of the object positioning model comprises:

determining a first cross-correlation training feature between the training sample image and the training reference image based on image features of the training sample image and image features of the training reference image;

inputting the first cross-correlation training features into the regression network to determine a regression region corresponding to the training sample image;

the determining a regression accuracy prediction value for the regression region based on the regression accuracy prediction network of the object location model comprises:

Inputting the first cross-correlation training feature into the regression accuracy prediction network to determine a regression accuracy prediction value for the regression region;

the determining a classification confidence for the regression region based on image features of the training sample images and a classification network of the object localization model comprises:

determining a second cross-correlation training feature between the training sample image and the training reference image based on the image features of the training sample image and the image features of the training reference image;

inputting the second cross-correlated training features into the classification network to determine a classification confidence for the regression region;

the first cross-correlation training feature and the second cross-correlation training feature are used for representing the similarity degree between the training sample image and the training reference image.

4. The method of claim 3, wherein prior to the determining a first cross-correlated training feature between the training sample image and the training reference image based on the image features of the training sample image and the image features of the training reference image, the method further comprises:

acquiring image features of the training sample image and image features of the training reference image based on the object positioning model;

The determining a first cross-correlation training feature between the training sample image and the training reference image based on the image features of the training sample image and the image features of the training reference image comprises:

performing convolution operation on the image features of the training sample image and the image features of the training reference image respectively based on the object positioning model to obtain first training intermediate features of the training sample image and first reference intermediate features of the training reference image;

and performing cross-correlation operation on the first training intermediate feature and the first reference intermediate feature based on the object positioning model to obtain a first cross-correlation training feature between the training sample image and the training reference image.

5. The method of claim 4, wherein the object localization model comprises two feature extraction networks with the same model structure and sharing model parameters; the obtaining image features of the training sample image and image features of the training reference image based on the object localization model comprises:

inputting the training sample image and the training reference image into the two feature extraction networks respectively;

And outputting the image characteristics of the training sample image and the image characteristics of the training reference image in parallel and respectively through the two characteristic extraction networks.

6. The method of claim 1, wherein the calculating the regression accuracy of the regression region based on the target object labeling region corresponding to the training sample image comprises:

acquiring an intersection region between the target object labeling region and the regression region;

acquiring a union region between the target object labeling region and the regression region;

and determining the regression accuracy according to the area ratio of the intersection region corresponding to the regression region to the union region.

7. The method of claim 1, wherein said calculating a classification penalty based on said classification confidence comprises:

acquiring the area ratio of the intersection region and the union region corresponding to the regression region;

when the area ratio is larger than a first preset threshold value, determining classification loss according to the positive sample label corresponding to the training sample image and the classification confidence coefficient;

And when the area ratio is smaller than a second preset threshold, determining the classification loss according to the negative sample label corresponding to the training sample image and the classification confidence coefficient.

8. The method according to claim 2, characterized in that the target object is an object to be identified; the method further comprises the following steps:

acquiring a video to be identified;

determining a first frame video frame comprising the object to be identified in the video to be identified to obtain a reference image to be identified;

sequentially acquiring video frames from a video frame next to the first frame video frame to serve as images to be identified;

respectively inputting the image to be recognized and the reference image to be recognized into the object positioning model;

acquiring a first cross-correlation identification feature and a second cross-correlation identification feature between the image to be identified and the reference image to be identified based on the object positioning model;

inputting the first cross-correlation identification features into a regression network of the object location model to determine a plurality of regression regions of the image to be identified;

inputting the second cross-correlation identification features into a classification network of the object positioning model to determine classification confidence of each regression region of the image to be identified;

And positioning the object to be recognized from the image to be recognized according to the classification confidence of each regression region of the image to be recognized.

9. A method for object localization, the method comprising:

acquiring an input image;

and carrying out object positioning on the input image according to the classification confidence of each regression region of the input image to obtain the region where the target object is located.

10. The method of claim 9, wherein the target object is an object to be identified; the acquiring the input image includes:

acquiring an input image and a reference image corresponding to the input image; the reference image is an image of a target object in one frame of the image frame sequence in which the input image is positioned and before the input image;

the determining a plurality of regression regions for the input image based on image features of the input image and the regression network comprises:

determining a first cross-correlation feature between the input image and the reference image based on image features of the input image and image features of the reference image;

inputting the first cross-correlation feature into the regression network to determine a plurality of regression regions of the input image;

The determining classification confidences for respective regression regions of the input image based on image features of the input image and the classification network comprises:

determining a second cross-correlation feature between the input image and the reference image based on image features of the input image and image features of the reference image;

inputting the second cross-correlated features into the classification network to determine classification confidences for respective regression regions of the input image.

11. The method of claim 10, wherein the object localization model further comprises a regression accuracy prediction network; the object positioning model is obtained by training regression accuracy loss of the regression accuracy prediction network, target regression loss of the regression network and target classification loss of the classification network; the regression accuracy loss is determined based on a regression accuracy prediction value for the regression region and a regression accuracy for the regression region; the regression accuracy prediction value is determined based on image features of the training sample image and the regression accuracy prediction network;

before the locating a target object from the input image according to the classification confidence of the respective regression regions of the input image, the method further comprises:

Inputting the first cross-correlation feature into the regression accuracy prediction network to determine regression accuracy predictions for respective regression regions of the input image;

performing object positioning on the input image according to the classification confidence of each regression region of the input image to obtain a region where a target object is located, wherein the step of performing object positioning on the input image comprises the following steps:

correspondingly multiplying the regression accuracy predicted values of the regression regions of the input image by the classification confidence coefficients to obtain target positioning scores of the regression regions of the input image;

and carrying out object positioning on the input image according to the target positioning scores of all regression areas of the input image to obtain the area where the target object is located.

12. An object localization model processing apparatus, characterized in that the apparatus comprises:

13. An object positioning apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an input image;

A regression region determination module for determining a plurality of regression regions for the input image based on the image features of the input image and the regression network;

and the positioning module is used for positioning the object of the input image according to the classification confidence of each regression region of the input image to obtain the region of the target object.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.