CN114462559A

CN114462559A - Target positioning model training method, target positioning method and device

Info

Publication number: CN114462559A
Application number: CN202210387877.5A
Authority: CN
Inventors: 张天柱; 张哲�; 张勇东; 孟梦; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-05-10
Anticipated expiration: 2042-04-14
Also published as: CN114462559B

Abstract

The invention provides a target positioning model training method, a target positioning method and a target positioning device, which can be applied to the technical field of artificial intelligence. The target positioning model training method comprises the following steps: acquiring a sample data set; inputting each image sample into a pixel feature extraction layer of the initial model, and outputting the pixel feature of each pixel point in image data; inputting the first characteristic data and the second characteristic data into an activation map generation layer of the initial model, and outputting a foreground activation map and a background activation map; inputting the third feature data and the fourth feature data into a perception feature extraction layer of the initial model respectively, and outputting a foreground perception feature and a background perception feature; inputting the fifth feature data and the sixth feature data into a classification layer of the initial model respectively, and outputting a classification result; and adjusting the model parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model.

Description

Target positioning model training method, target positioning method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a target positioning model training method, a target positioning device, equipment and a medium.

Background

Traditional target positioning methods are usually based on fully supervised learning, and the methods train a model by utilizing a manually labeled bounding box so as to position a target. However, such labeling data is expensive and requires a lot of time, limiting the usefulness of target localization.

With the development of artificial intelligence technology, the weakly supervised learning algorithm gradually receives attention, however, the inventor finds out in the process of implementing the invention that: when the existing weak supervised learning algorithm is applied to target positioning, the problem of incomplete positioning exists.

Disclosure of Invention

In view of the above problems, the present invention provides a target positioning model training method, a target positioning method, an apparatus, a device, and a medium.

According to an aspect of the present invention, there is provided an object localization model training method, including:

the method comprises the steps of obtaining a sample data set, wherein the sample data set comprises a plurality of image samples, each image sample comprises image data, and each image sample is provided with an image category label;

inputting each image sample into a pixel feature extraction layer of the initial model, and outputting the pixel feature of each pixel point in image data;

inputting first feature data and second feature data into an activation map generation layer of the initial model, and outputting a foreground activation map and a background activation map, wherein the first feature data comprise pixel features and preset foreground prototype data, and the second feature data comprise pixel features and preset background prototype data; the foreground activation image comprises a first similarity representing the similarity between the pixel feature of each pixel and the preset foreground prototype data, the background activation image comprises a second similarity representing the similarity between the pixel feature of each pixel and the preset background prototype data;

respectively inputting third feature data and fourth feature data into a perception feature extraction layer of the initial model, and outputting foreground perception features and background perception features, wherein the third feature data comprise pixel features and a foreground activation image, and the fourth feature data comprise pixel features and a background activation image;

inputting fifth feature data and sixth feature data into a classification layer of the initial model respectively, and outputting a classification result, wherein the fifth feature data comprise foreground perception features and preset category prototype data, the sixth feature data comprise background perception features and preset category prototype data, and the preset category prototype data represent categories of partial regions of the target object; and

and adjusting the model parameters of the initial model according to the classification result and the image category label to obtain the trained target positioning model.

According to the embodiment of the invention, the method for inputting the pixel characteristics into the activation map generation layer of the initial model and outputting the foreground activation map and the background activation map comprises the following steps:

inputting the pixel characteristics into an activation map generation layer of an initial model, and outputting an initial foreground activation map and an initial background activation map;

splicing the initial foreground activation image and the initial background activation image to obtain an initial activation matrix;

calculating a semantic loss matrix of the pixel characteristics by using an optimal transmission algorithm, wherein the semantic loss matrix represents optimal distribution results of the pixel characteristics in the foreground and the background;

and adjusting the model parameters of the activation map generation layer of the initial model according to the semantic loss matrix and the initial activation matrix so as to output the optimized foreground activation map and background activation map.

According to the embodiment of the invention, the classification layer of the initial model comprises a classification feature extraction layer and a classification matching layer, wherein the fifth feature data and the sixth feature data are respectively input into the classification layer of the initial model, and the classification result is output, and the classification result comprises the following steps:

inputting the fifth feature data and the sixth feature data into a category feature extraction layer respectively, and outputting a foreground category feature and a background category feature;

inputting the foreground category characteristics and the foreground perception characteristics into a category matching layer, and outputting a foreground classification result;

and inputting the background category characteristics and the background perception characteristics into a category matching layer, and outputting a background classification result.

According to the embodiment of the invention, the model parameters of the initial model are adjusted according to the classification result and the image class label to obtain the trained target positioning model, and the method comprises the following steps:

constructing a classification loss function according to the foreground classification result parameter and the background classification result parameter;

constructing an activation loss function according to the semantic loss matrix parameters and the activation matrix parameters;

constructing a model loss function according to the classification loss function and the activation loss function;

and adjusting the model parameters of the initial model according to the classification result and the image category label by using a model loss function to obtain the trained target positioning model.

Another aspect of the present invention provides a target positioning method, including:

inputting an image to be detected into a pixel feature extraction layer of a target positioning model, and outputting the pixel feature to be detected of each pixel point in the image to be detected, wherein the target positioning model is obtained by the training method;

inputting the pixel feature to be detected and preset foreground prototype data into an activation map generation layer of a target positioning model, and outputting a foreground activation map, wherein the foreground activation map comprises a third similarity representing the similarity between the pixel feature to be detected and the preset foreground prototype data;

determining a target positioning area according to the third similarity in the foreground activation image;

inputting the pixel characteristics to be detected and the foreground activation map into a perception characteristic extraction layer of the target positioning model, and outputting foreground perception characteristics;

and inputting the foreground perception characteristics and preset category prototype data into a classification layer, and outputting a classification result of the image to be detected, wherein the preset category prototype data represents the category of a partial region of the target object.

According to the embodiment of the invention, the determining the target positioning area according to the third similarity in the foreground activation map comprises the following steps:

extracting a plurality of target pixel points from the image to be detected according to the foreground activation image, wherein the target pixel points represent pixel points corresponding to target similarity in the foreground activation image, and the target similarity represents similarity larger than a preset threshold in third similarity;

and determining an adjacent graphic area covering a plurality of target pixel points as a target positioning area.

Another aspect of the present invention provides an object localization model training apparatus, including: the device comprises an acquisition module, a first extraction module, a first generation module, a second extraction module, a first classification module and a training module. The acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a plurality of image samples, each image sample comprises image data, and each image sample has an image category label. And the first extraction module is used for inputting each image sample into the pixel feature extraction layer of the initial model and outputting the pixel feature of each pixel point in the image data. The first generation module is used for inputting first feature data and second feature data into an activation map generation layer of the initial model and outputting a foreground activation map and a background activation map, wherein the first feature data comprise pixel features and preset foreground prototype data, and the second feature data comprise pixel features and preset background prototype data; the foreground activation image comprises a first similarity representing the similarity between the pixel feature of each pixel and the preset foreground prototype data, the background activation image comprises a second similarity representing the similarity between the pixel feature of each pixel and the preset background prototype data. And the second extraction module is used for respectively inputting third feature data and fourth feature data into the perceptual feature extraction layer of the initial model and outputting foreground perceptual features and background perceptual features, wherein the third feature data comprise pixel features and a foreground activation image, and the fourth feature data comprise pixel features and a background activation image. And the first classification module is used for respectively inputting fifth feature data and sixth feature data into a classification layer of the initial model and outputting a classification result, wherein the fifth feature data comprise foreground perception features and preset class prototype data, the sixth feature data comprise background perception features and preset class prototype data, and the preset class prototype data represent the class of a partial region of the target object. And the training module is used for adjusting the model parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model.

According to the embodiment of the invention, the first generation module comprises a first generation unit, a splicing unit, a first calculation unit and a first adjustment unit. The first generation unit is used for inputting the pixel characteristics into an activation map generation layer of an initial model and outputting an initial foreground activation map and an initial background activation map. And the splicing unit is used for splicing the initial foreground activation image and the initial background activation image to obtain an initial activation matrix. The first calculation unit calculates a semantic loss matrix of the pixel characteristics by using an optimal transmission algorithm, wherein the semantic loss matrix represents optimal distribution results of the pixel characteristics in the foreground and the background. And the first adjusting unit is used for adjusting the model parameters of the activation map generation layer of the initial model according to the semantic loss matrix and the initial activation matrix so as to output the optimized foreground activation map and background activation map.

According to an embodiment of the present invention, the first classification module includes a first extraction unit, a first classification unit, and a second classification unit. The first extraction unit is used for inputting the fifth feature data and the sixth feature data into the category feature extraction layer respectively and outputting the foreground category features and the background category features. And the first classification unit is used for inputting the foreground classification characteristic and the foreground perception characteristic into the classification matching layer and outputting a foreground classification result. And the second classification unit is used for inputting the background classification features and the background perception features into the classification matching layer and outputting a background classification result.

According to an embodiment of the invention, the training module comprises a first building element, a second building element, a third building element and a training element. The first construction unit is used for constructing a classification loss function according to the foreground classification result parameter and the background classification result parameter. And the second construction unit is used for constructing the activation loss function according to the semantic loss matrix parameters and the activation matrix parameters. And the third construction unit is used for constructing a model loss function according to the classification loss function and the activation loss function. And the training unit is used for adjusting the model parameters of the initial model according to the classification result and the image category label by using the model loss function to obtain a trained target positioning model.

Another aspect of the present invention provides an object localization apparatus comprising: the device comprises a third extraction module, a second generation module, a determination module, a fourth extraction module and a second classification module. And the third extraction module is used for inputting the image to be detected into the pixel feature extraction layer of the target positioning model and outputting the pixel feature to be detected of each pixel point in the image to be detected, wherein the target positioning model is obtained by adopting the training method. And the second generation module is used for inputting the pixel characteristics to be detected and the preset foreground prototype data into an activation map generation layer of the target positioning model and outputting a foreground activation map, wherein the foreground activation map comprises a third similarity, and the third similarity represents the similarity between the pixel characteristics to be detected and the preset foreground prototype data. And the determining module is used for determining the target positioning area according to the third similarity in the foreground activation image. And the fourth extraction module is used for inputting the pixel characteristics to be detected and the foreground activation graph into a perception characteristic extraction layer of the target positioning model and outputting the foreground perception characteristics. And the second classification module is used for inputting the foreground perception characteristics and the preset class prototype data into the classification layer and outputting the classification result of the image to be detected, wherein the preset class prototype data represents the class of a partial region of the target object.

According to an embodiment of the present invention, the determination module includes a second extraction unit and a determination unit. The second extraction unit is used for extracting a plurality of target pixel points from the image to be detected according to the foreground activation image, wherein the target pixel points represent pixel points corresponding to target similarity in the foreground activation image, and the target similarity represents similarity larger than a preset threshold in third similarity. And the determining unit is used for determining the adjacent graphic area covering the target pixel points as a target positioning area.

Another aspect of the present invention provides an electronic device, including: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described object localization model training method or object localization method.

Another aspect of the present invention also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described object localization model training method or object localization method.

Another aspect of the invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the above object localization model training method or object localization method.

According to the embodiment of the invention, the image samples with the image category labels are obtained, each image sample is input into the pixel feature extraction layer of the initial model, and the pixel feature of each pixel point is output. And then inputting the pixel characteristics and preset foreground prototype data into an activation map generation layer of the initial model, outputting a foreground activation map, inputting the pixel characteristics and preset background prototype data into the activation map generation layer of the initial model, and outputting a background activation map. And extracting foreground perception features and background perception features through a perception feature extraction layer of the initial model, respectively inputting the foreground perception features and preset category prototype data as well as the background perception features and the preset category prototype data into a classification layer of the initial model, and outputting a classification result. And finally, adjusting parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model.

According to the embodiment of the invention, the foreground activation map and the background activation map generated by the invention are obtained according to the pixel characteristics and the preset foreground prototype data or the preset background prototype data, so that the foreground activation map comprises the similarity between all the pixel characteristics and the preset foreground prototype data, the background activation map comprises the similarity between all the pixel characteristics and the preset background prototype data, and the foreground activation map and the background activation map do not relate to the category of the image.

Drawings

The foregoing and other objects, features and advantages of the invention will be apparent from the following description of embodiments of the invention, which proceeds with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a method, apparatus, device, medium and program product for object localization model training according to embodiments of the present invention;

FIG. 2 schematically illustrates an exemplary system framework of an object localization model according to an embodiment of the present invention;

FIG. 3 schematically illustrates a flow chart of a method of training an object localization model according to an embodiment of the present invention;

FIG. 4 schematically shows a flow chart of a method of target localization according to an embodiment of the present invention;

FIG. 5 schematically illustrates a diagram for locating a target region using a foreground activation map

FIG. 6 is a block diagram schematically illustrating the structure of an object localization model training apparatus according to an embodiment of the present invention;

FIG. 7 schematically shows a block diagram of an object locating device according to an embodiment of the present invention; and

fig. 8 schematically shows a block diagram of an electronic device adapted to implement the above described object localization model training method or object localization method according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It is to be understood that such description is merely illustrative and not intended to limit the scope of the present invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the invention, the collection, storage, use, processing, transmission, provision, disclosure, application and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

In the technical scheme of the invention, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

In the process of implementing the present invention, the inventor finds that training a positioning model by using a class-specific activation map in the related art may cause the positioning model to identify only a local region where a target is most discriminative and to position the target in the local region, for example: the positioning model in the related art generally only identifies and positions the heads of the birds, namely, the target positioning is completed, but the bodies and other parts of the birds are not positioned. The result of this positioning is that the two birds are positioned in the same position, and in fact, the two birds are positioned only at the same head, but at different body positions.

In view of this, an embodiment of the present invention provides a method for training a target location model, in which image samples with image category labels are obtained, each image sample is input into a pixel feature extraction layer of an initial model, and a pixel feature of each pixel point is output. And then inputting the pixel characteristics and preset foreground prototype data into an activation map generation layer of the initial model, outputting a foreground activation map, inputting the pixel characteristics and preset background prototype data into the activation map generation layer of the initial model, and outputting a background activation map. And extracting foreground perception features and background perception features through a perception feature extraction layer of the initial model, respectively inputting the foreground perception features and preset category prototype data as well as the background perception features and the preset category prototype data into a classification layer of the initial model, and outputting a classification result. And finally, adjusting parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model for completely positioning the target.

Fig. 1 schematically shows an application scenario of the target location model training method according to the embodiment of the present invention.

As shown in fig. 1, the application scenario 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the object location model training method or the object location method provided by the embodiment of the present invention may be generally executed by the server 105. Accordingly, the object location model training device or the object location device provided by the embodiment of the present invention may be generally disposed in the server 105. The target location model training method or the target location method provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the target location model training apparatus or the target location apparatus provided in the embodiment of the present invention may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the target location model training method or the target location method provided by the embodiment of the present invention may also be executed by the

terminal device

101, 102, or 103, or may also be executed by another terminal device different from the

terminal device

101, 102, or 103. Accordingly, the target location model training apparatus or the target location apparatus provided in the embodiment of the present invention may also be disposed in the

terminal device

101, 102, or 103, or disposed in another terminal device different from the

terminal device

101, 102, or 103.

For example: the image sample data set may be obtained by any of the

terminal devices

101, 102, or 103 (for example, the terminal device 101, but is not limited thereto), may be stored in any of the

terminal devices

101, 102, or 103, or may be stored on an external storage device and may be imported into the terminal device 101. Then, the terminal device 101 may locally execute the target location model training method provided by the embodiment of the present invention, or send the image sample data set to another terminal device, a server, or a server cluster, and execute the target location model training method provided by the embodiment of the present invention by another terminal device, a server, or a server cluster that receives the image sample data set.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 schematically shows an exemplary system architecture of an object localization model according to an embodiment of the present invention.

As shown in fig. 2, the target location model includes a four-layer architecture: pixel feature extraction layer 201, activation map generation layer 202, perceptual feature extraction layer 203, classification layer 204. The pixel feature extraction layer 201 is configured to extract a pixel feature of each pixel point in the image sample. An activation map generation layer 202 for generating a foreground activation map and a background activation map. And the perceptual feature extraction layer 203 is used for extracting foreground perceptual features and background perceptual features from the pixel features. And the classification layer 204 is used for classifying the foreground perception features and the background perception features and outputting classification results, wherein the classification results comprise foreground classification results and background classification results.

FIG. 3 schematically shows a flow chart of a method of training an object localization model according to an embodiment of the present invention.

As shown in FIG. 3, the method for training the object location model of the embodiment includes operations S310 to S360.

In operation S310, a sample data set is obtained, where the sample data set includes a plurality of image samples, each image sample includes image data, and the image samples have image category labels.

According to an embodiment of the present invention, the image sample may be a picture including a target object, and the image category label may be a category of the target object. For example: the image sample may be a picture of other animals including a human, a cat, a dog, or an object such as a table, a book, or the like. Taking a picture including a cat as an example, the picture may further include grass, sky, trees, flowers, and so on. The image category labels may be category labels for people, cats, dogs, tables, books, etc.

In operation S320, each image sample is input to the pixel feature extraction layer of the initial model, and the pixel feature of each pixel point in the image data is output.

According to an embodiment of the present invention, the pixel feature extraction layer may include a backbone network and a transform encoder. The backbone network may be a ResNet50 network. The pixel points passing through the backbone network can be expressed as a vector shown in formula (one).

(A)

Wherein Z represents a set consisting of hw pixel points, Z_hwAnd expressing the hw pixel point, wherein h expresses the picture height of the image sample, and w expresses the picture width of the image sample.

According to an embodiment of the present invention, three matrices in the transform encoder are utilized: inquiring a matrix (Query matrix), a Key Value matrix (Key matrix) and a Value matrix (Value matrix), and performing linear transformation on each image sample characteristic, wherein the process of linear transformation is as follows:

(two-1)

(two-2)

(two-3)

Wherein Q is_iIs shown asiEach pixel point passes through

Pixel characteristics obtained by linear mapping of the matrix;

is shown asjEach pixel point passes through

Pixel characteristics obtained by linear mapping of the matrix;

is shown asjEach pixel point passes through

Linear mapping of the matrix results in pixel characteristics. Wherein,

a first weight matrix representing Query;

a first weight matrix representing a Key;

a first weight matrix of representations.

According to the embodiment of the invention, based on a self-attention mechanism, by utilizing a scaled dot-product algorithm (scaled dot-product), the pixel characteristics of each pixel point are obtained by calculating the similarity between each pixel point in an image sample and other pixel points in a picture:

(III)

Wherein,

is shown asiA pixel point and the firstjThe similarity of each pixel point is determined by the similarity of each pixel point,

of the representation

The method (2) is implemented by the following steps,

representing the scaling parameter.

(IV)

Wherein,

is shown asiThe pixel characteristics of the jth pixel point are weighted and summed by the similarity of the pixel point and the jth pixel point in the pattern sample to obtain the jth pixel pointiPixel characteristics of individual pixels, wherein, jis 1, 2 … hw.

Pair of representations

Obtained by normalization processing using the formula (V)

Will have a value of

The value of (2) is limited to 0 to 1.

(V)

Inputting first feature data and second feature data into an activation map generation layer of the initial model, and outputting a foreground activation map and a background activation map, wherein the first feature data includes pixel features and preset foreground prototype data, and the second feature data includes pixel features and preset background prototype data, in operation S330; the foreground activation image comprises a first similarity representing the similarity between the pixel feature of each pixel and the preset foreground prototype data, the background activation image comprises a second similarity representing the similarity between the pixel feature of each pixel and the preset background prototype data.

According to an embodiment of the present invention, the preset foreground prototype data may include foreground classes of all image samples in the sample data set, for example: and if the foreground classes in the sample data set comprise human, cat and dog, the preset foreground prototype data can be prototype data of the human, cat and dog. The preset background belongs to a background class that may include all image samples in the sample dataset, for example: the background class in the sample data set includes sky, tree, and grassland, and the preset background prototype data may be prototype data of sky, tree, and grassland.

According to the embodiment of the invention, the preset foreground prototype data and the preset background prototype data can be both preset foreground prototype data and preset background prototype data

And (4) obtaining the foreground prototype matrix and the background prototype matrix by mapping the dimensional vector according to the formula (six) and the formula (seven).

(VI)

(seven)

Wherein Q₁Representing preset foreground prototype data O^fThrough

Obtaining a foreground prototype matrix by linear mapping of the matrix; q₂Presetting background prototype data O^bThrough

A background prototype matrix is obtained by linear mapping of the matrix;

a second weight matrix representing Query.

It should be noted that, in the following description,

as described above

Although both represent the weight matrix of Query, the weights of the two are not necessarily the same and need to be determined according to specific model parameters.

(eight)

(nine)

Wherein,

is shown asjPixel characteristics of individual pixel points

Through

Pixel characteristics obtained by linear mapping of the matrix;

is shown asjPixel characteristics of individual pixel points

Through

Pixel characteristics obtained by linear mapping of the matrix; wherein,

a second weight matrix representing Key;

a second weight matrix representing Value.

According to the embodiment of the invention, the similarity between the pixel characteristic of each pixel point and the preset foreground prototype data and the similarity between the pixel characteristic of each pixel point and the preset background prototype data can be calculated based on a cross attention mechanism. The specific process is as follows:

(Ten)

Wherein, in the formula (nine)iThe value is 1, 2. When in usei=1，γ_j1，Representing the similarity of the preset foreground prototype data and the pixel characteristics of the jth pixel point; when in usei=2，γ_j2，Representing the similarity of the preset foreground prototype data and the pixel characteristics of the jth pixel point;

representing a scaling parameter;

to represent

The transposing of (1).

(eleven)

Wherein m is_i,jRepresents that gamma is_i,jGamma obtained by normalization treatment according to the formula (eleven)_i,jValue of gamma_i,jThe value of (2) is limited to 0 to 1. When in usei=1 hour, m_,j1Is shown in the foreground activation mapjThe similarity of the spatial positions of the pixels can also be referred to as an activation value. When in usei=At 2, m_,j2Is shown in the background activation chartjThe similarity of the spatial positions of the pixels can also be referred to as an activation value.

In operation S340, third feature data and fourth feature data are respectively input to the perceptual feature extraction layer of the initial model, and a foreground perceptual feature and a background perceptual feature are output, where the third feature data includes a pixel feature and a foreground activation map, and the fourth feature data includes a pixel feature and a background activation map.

According to the embodiment of the invention, in the perceptual feature extraction layer of the initial model, the foreground perceptual feature can be obtained by weighting and summing the features of the pixels according to the similarity between each pixel feature and the foreground prototype in the formula (twelve).

(twelve)

Wherein, O^f*Representing a foreground perceptual feature, m_,j1Is shown in the foreground activation mapjSimilarity, V, in spatial position of individual pixels_jIs shown asjPixel characteristics of individual pixel points.

According to the embodiment of the invention, in the perceptual feature extraction layer of the initial model, according to equation (thirteen), the feature weighting summation of the pixel points is performed according to the similarity between the feature of each pixel and the background prototype, so as to obtain the background perceptual feature.

(thirteen)

Wherein, O^b*Representing a background perceptual feature, m_,j2Is shown in the background activation chartjSimilarity, V, in spatial position of individual pixels_jIs shown asjPixel characteristics of individual pixel points.

In operation S350, fifth feature data and sixth feature data are respectively input into the classification layer of the initial model, and a classification result is output, where the fifth feature data includes foreground perceptual features and preset category prototype data, the sixth feature data includes background perceptual features and preset category prototype data, and the preset category prototype data represents a category of a partial region of the target object.

According to the embodiment of the invention, by taking the foreground classes in the sample training set as human, cat and dog as examples, the preset class prototype data can comprise the head of human, the head of cat and the head of dog; a human body, a cat body, a dog body, etc.

According to an embodiment of the present invention, the preset category prototype data is expressed as:

(fourteen)

Wherein,

representing preset category prototype data.

According to the embodiment of the invention, the foreground perception characteristic and the preset category prototype data can be respectively subjected to linear mapping.

(fifteen)

(sixteen)

(seventeen)

Wherein,

representing a foreground perceptual feature O^f*Through

Performing linear mapping on the matrix to obtain foreground perception characteristics; k_iRepresents the ith category prototype data P_iThrough

A class prototype matrix is obtained by matrix linear mapping; v_iIth class prototype data P_iThrough

A category prototype matrix obtained by matrix linear mapping;

a third weight matrix representing Query;

a third weight matrix representing Key;

a third weight matrix representing Value.

According to the embodiment of the invention, the foreground category characteristics are extracted by calculating the similarity between the foreground perception characteristics and the preset category data.

(eighteen)

Wherein,

representing the similarity of the foreground perception features and preset type prototype data;

is represented by K_iTransposing;

representing the scaling factor.

(nineteen)

Wherein,

representing results obtained by normalisation

Value of will

The value is limited to 0 to 1.

(twenty)

Wherein,Prepresenting foreground category features, N representing the number of categories of preset category prototype data, V_iIth class prototype data P_iThrough a process

And (5) linearly mapping the matrix to obtain a category prototype matrix.

In operation S360, the model parameters of the initial model are adjusted according to the classification result and the image category label, so as to obtain a trained target location model.

According to the embodiment of the present invention, the model parameters of the initial model may be adjusted according to the difference between the classification result and the image category label, for example: the classification result output after the model parameters are adjusted for many times and the image category label can be input into a model loss function, and when the change of the model loss function approaches zero, the training of the initial model is represented to be completed, so that a trained target positioning model is obtained.

In order to further optimize the obtained foreground activation map and background activation map, the generation process of the foreground activation map and the background activation map can be regarded as an optimal transmission process, the process is marked as T, and the distribution result is calculated from the global perspective by using a Sinkhorn algorithm

. The distribution result

The generation function of the foreground activation map and the background activation map can be supervised as pseudo labels.

According to the embodiment of the invention, the optimized objective function obtained by using the optimal transmission algorithm (Sinkhorn algorithm) is expressed as the following formula (twenty one):

(twenty-one)

Wherein T denotes a semantic loss matrix, M^FBWhich represents the initial activation matrix, is,

an entropy function representing T, T representing a transmission process matrix that generates a foreground activation map and a background activation map, Tr representing the traces of the T matrix.

According to the embodiment of the invention, the generation process of the pseudo label supervision foreground activation graph and the pseudo label supervision background activation graph with T as the optimal distribution result can be utilized by utilizing the binary cross entropy loss, so that the semantic loss matrix T and the initial activation matrix M are enabled to be^FBThe difference therebetween tends to be zero.

(twenty-two)

Wherein,

a function representing the activation loss is represented by,

representing a binary cross entropy loss.

According to the embodiment of the invention, the optimal distribution result is used as a pseudo label to supervise the generation process of the foreground activation image and the background activation image by using the optimal transmission algorithm, the activation image generation layer of the initial model of the model is optimized, the optimal distribution result of the pixel characteristics in the foreground and the background is obtained, the purpose of optimizing the foreground activation image and the background activation image is achieved, and the positioning accuracy of the target positioning model is improved.

According to the embodiment of the invention, taking a picture including sky, grassland, trees, puppies and puppies as an example, the foreground perception features may include pixel features representing the puppies and the background perception features may include pixel features representing the sky, pixel features representing the grassland and pixel features representing the trees. The preset category prototype data may include a head of a puppy, a body of a puppy, a tail of a puppy, a head of a person, a body of a person, and so forth. Sky, grassland and trees can be used as the category prototype data of a background without classification.

According to the embodiment of the invention, the foreground perception feature and the preset class prototype data are input into the class feature extraction layer, and the obtained foreground class features can comprise the head of a bird, the head of a dog, the body of the bird, the body of the dog, the tail of the bird and the tail of the dog. Inputting the background perception features and the preset class prototype data into a class feature extraction layer, wherein the obtained background class features can comprise sky, grassland and trees.

According to the embodiment of the invention, the foreground classification feature and the foreground perception feature are used as global features to be input into the classification layer of the initial model, so that the foreground classification result of the picture can be obtained, such as birds and dogs. And inputting the background class characteristics and the background perception characteristics into a classification layer of the initial model to obtain a background classification result of the picture.

According to the embodiment of the invention, the foreground classification result is obtained by extracting the foreground classification characteristic with discrimination from the foreground perception characteristic and inputting the foreground classification characteristic and the foreground perception characteristic into the classification matching layer as the global characteristic, so that the target positioning module obtained by training can not only output the positioning of the target object but also output the classification of the target object.

According to the embodiment of the invention, the model parameters of the initial model are adjusted according to the classification result and the image category label to obtain the trained target positioning model, which comprises the following steps:

According to the embodiment of the invention, a classification loss function is constructed according to the foreground classification result parameter and the background classification result parameter, as shown in formulas (twenty three) and (twenty four):

(twenty-three)

(twenty four)

Wherein,

a function representing the loss of the foreground classification is represented,

representing the background classification penalty function, C representing the number of classes of foreground class features,

representing the second in the foreground classification resultiThe number of the elements is one,

representing the second in the context classification resultiAnd (4) each element.

The first of the parameters representing the foreground classification resultiAnd (4) each element.

According to the embodiment of the present invention, since the background class is used as a special class and the specific background is not distinguished, the background class can be set to be a vector of all 1, that is, each element in the background classification result parameter is 1.

According to an embodiment of the invention, the model loss function may be represented by the equation (twenty-five):

(twenty five)

Wherein,

a function representing the loss of the model is represented,

representing background classificationsThe function of the loss is a function of,

a function representing the activation loss is represented by,

、

both represent equilibrium coefficients.

According to the embodiment of the invention, the model loss function constructed by the foreground classification loss function, the background classification loss function and the activation loss function in a communication way can enable the pixel characteristics of the sample data to reach the classification result closest to the optimal classification result in the foreground and the background, so that more accurate foreground perception characteristics and background perception characteristics are extracted, and the accuracy of foreground classification and background classification is improved.

Fig. 4 schematically shows a flow chart of a method of object localization according to an embodiment of the present invention.

As shown in fig. 4, the object locating method of this embodiment includes operations S410 to S450.

In operation S410, the image to be detected is input into the pixel feature extraction layer of the target location model, and the pixel feature to be detected of each pixel point in the image to be detected is output, wherein the target location model is obtained by training using the training method of the present invention.

According to an embodiment of the invention, for example: the image to be measured is a picture including the sky, the tree and a bird on the tree. The image to be detected is input into the pixel feature extraction layer of the target positioning model, and the pixel feature to be detected of each pixel point in the image to be detected can be output.

In operation S420, the pixel feature to be detected and the preset foreground prototype data are input into an activation map generation layer of the target positioning model, and a foreground activation map is output, where the foreground activation map includes a third similarity representing a similarity between the pixel feature to be detected and the preset foreground prototype data.

According to the embodiment of the invention, the foreground activation map includes the similarity between the feature of the pixel to be detected of each pixel point and the preset foreground prototype data, and the similarity can be any value between 0 and 1, for example: in the foreground activation map, the similarity between the pixel features of the pixels characterizing the bird area in the picture and the preset foreground prototype data is close to 1, which may be 0.6, 0.8, 0.9, and the like, and the similarity between the pixel features of the pixels characterizing the sky area and the tree area in the picture and the preset foreground prototype data is close to 0, which may be 0.1, 0.2, and the like.

In operation S430, a target location area is determined according to the third similarity in the foreground activation map.

According to the embodiment of the invention, the pixel point region of which the third similarity approaches to 1 in the foreground activation map can be determined as the target positioning region. The target positioning area may be represented by a plurality of coordinate points of a graph adjoining the target positioning area.

In operation S440, the pixel feature to be detected and the foreground activation map are input to the perceptual feature extraction layer of the target location model, and the foreground perceptual feature is output.

According to the embodiment of the present invention, the foreground perceptual feature may be a foreground perceptual feature vector calculated by using the formula (twelve) described above. For example: and the original image represents the pixel characteristic vector of the bird area.

In operation S450, the foreground sensing feature and the preset category prototype data are input to the classification layer, and a classification result of the image to be detected is output, where the preset category prototype data represents a category of a partial region of the target object.

According to an embodiment of the present invention, a pixel feature vector representing a bird region in an original image and preset category prototype data, which may include a bird head, a dog head, a human head, and the like, are input to a classification layer. And respectively matching the pixel characteristic vectors representing the bird area with preset category prototype data to obtain matching results, namely outputting the category of the foreground perception characteristic to belong to the bird, and outputting the classification result of the image to be detected to be the bird.

According to the embodiment of the invention, the image to be detected is detected through the target positioning model obtained by the training method provided by the invention, because the foreground activation map is generated based on the similarity between the pixel characteristics and the preset prototype data, the complete positioning area of the target object can be obtained, the foreground perception characteristics are extracted according to the foreground activation map, the category of the foreground perception characteristics is further identified, the positioning of the target object is completed, and the problem of incomplete target positioning in the prior art is solved.

and determining the adjacent graphic area covering the target pixel points as a target positioning area.

According to the embodiment of the invention, before extracting a plurality of target pixel points from the image to be detected, bilinear interpolation can be performed on the generated foreground activation image, and the matrix dimensions of the foreground activation image and the image to be detected are the same, for example: the foreground activation map is a3 × 3 matrix, the image to be measured is a 5 × 5 matrix, and an average value of the ith element and the (i + 1) th element in the foreground activation map may be interpolated, for example: the ith element is 0.2, the (i + 1) th element is 0.3, the inserted element can be 0.25, and the matrix of the foreground activation map is converted into a matrix with the same dimension as the matrix of the image to be detected.

FIG. 5 schematically illustrates a diagram of locating a target region using a foreground activation map;

as shown in fig. 5, the embodiment includes a pixel feature matrix 501 to be measured and a foreground activation map 502. For example: the preset threshold is 0.5, and in the foreground activation map 502, the target pixel points greater than the preset threshold are a3, B2, B3, C2, and C3.

According to the embodiment of the invention, the adjacent graphic area covering the target pixels is determined as the target positioning area, wherein the adjacent graphic area can be the maximum adjacent rectangle adjacent to the target pixels or other shapes of the maximum adjacent area. Taking the largest contiguous rectangle as an example, the target location area may be represented by the coordinates of the four vertices of the rectangle, i.e., the coordinates of point M, N, P, Q.

According to the embodiment of the invention, the positioning area of the target is determined according to the similarity between the pixel characteristics and the preset foreground prototype data through the foreground activation image with unknown category, so that the completeness of target positioning can be ensured.

Based on the target positioning model training method, the invention also provides a target positioning model training device. The apparatus will be described in detail below with reference to fig. 6.

FIG. 6 is a block diagram schematically illustrating an embodiment of an apparatus for training an object localization model according to the present invention.

As shown in fig. 6, the apparatus 600 for training an object location model of this embodiment includes an obtaining module 610, a first extracting module 620, a first generating module 630, a second extracting module 640, a first classifying module 650, and a training module 660.

The obtaining module 610 is configured to obtain a sample data set, where the sample data set includes a plurality of image samples, each image sample includes image data, and each image sample has an image category tag. In an embodiment, the obtaining module 610 may be configured to perform the operation S310 described above, which is not described herein again.

The first extraction module 620 is configured to input each image sample into a pixel feature extraction layer of the initial model, and output a pixel feature of each pixel point in the image data. In an embodiment, the first extracting module 620 may be configured to perform the operation S320 described above, which is not described herein again.

A first generation module 630, configured to input first feature data and second feature data into an activation map generation layer of the initial model, and output a foreground activation map and a background activation map, where the first feature data includes pixel features and preset foreground prototype data, and the second feature data includes pixel features and preset background prototype data; the foreground activation image comprises a first similarity representing the similarity between the pixel feature of each pixel and the preset foreground prototype data, the background activation image comprises a second similarity representing the similarity between the pixel feature of each pixel and the preset background prototype data. In an embodiment, the first generating module 630 may be configured to perform the operation S330 described above, which is not described herein again.

The second extraction module 640 inputs third feature data and fourth feature data into the perceptual feature extraction layer of the initial model, and outputs foreground perceptual features and background perceptual features, where the third feature data includes pixel features and a foreground activation map, and the fourth feature data includes pixel features and a background activation map. In an embodiment, the second extracting module 640 may be configured to perform the operation S340 described above, which is not described herein again.

The first classification module 650 is configured to input fifth feature data and sixth feature data into a classification layer of the initial model, and output a classification result, where the fifth feature data includes a foreground perceptual feature and preset-category prototype data, the sixth feature data includes a background perceptual feature and preset-category prototype data, and the preset-category prototype data represents a category of a partial region of the target object. In an embodiment, the first classification module 650 may be configured to perform the operation S350 described above, and is not described herein again.

And the training module 660 is configured to adjust the model parameters of the initial model according to the classification result and the image category label, so as to obtain a trained target positioning model. In an embodiment, the training module 660 may be configured to perform the operation S360 described above, which is not described herein again.

Based on the target positioning method, the invention also provides a target positioning device. The apparatus will be described in detail below with reference to fig. 7.

FIG. 7 schematically shows a block diagram of an object locating device according to an embodiment of the present invention.

As shown in fig. 7, the object locating device includes: a third extraction module 710, a second generation module 720, a determination module 730, a fourth extraction module 740, and a second classification module 750.

The third extraction module 710 is configured to input the image to be detected into the pixel feature extraction layer of the target location model, and output the pixel feature to be detected of each pixel point in the image to be detected, where the target location model is obtained by using the training method of the present invention. In an embodiment, the third extracting module 710 may be configured to perform the operation S410 described above, which is not described herein again.

And a second generating module 720, configured to input the pixel feature to be detected and the preset foreground prototype data into an activation map generating layer of the target positioning model, and output a foreground activation map, where the foreground activation map includes a third similarity representing a similarity between the pixel feature to be detected and the preset foreground prototype data. In an embodiment, the third extracting module 720 may be configured to perform the operation S420 described above, which is not described herein again.

And a determining module 730, configured to determine the target location area according to the third similarity in the foreground activation map. In an embodiment, the third extracting module 730 can be configured to perform the operation S430 described above, which is not described herein again.

And a fourth extraction module 740, configured to input the pixel feature to be detected and the foreground activation map into the perceptual feature extraction layer of the target positioning model, and output the foreground perceptual feature. In an embodiment, the third extracting module 740 may be configured to perform the operation S440 described above, which is not described herein again.

And the second classification module is used for inputting the foreground perception characteristics and the preset class prototype data into the classification layer and outputting the classification result of the image to be detected, wherein the preset class prototype data represents the class of a partial region of the target object. In an embodiment, the third extracting module 750 may be configured to perform the operation S450 described above, and is not described herein again.

According to an embodiment of the present invention, the determination module includes a second extraction unit and a determination unit. The second extraction unit is used for extracting a plurality of target pixel points from the image to be detected according to the foreground activation image, wherein the target pixel points represent pixel points corresponding to the target similarity in the foreground activation image, and the target similarity represents the similarity of a third similarity which is larger than a preset threshold. And the determining unit is used for determining the adjacent graphic area covering the target pixel points as a target positioning area.

According to the embodiment of the present invention, any plurality of the obtaining module 610, the first extracting module 620, the first generating module 630, the second extracting module 640, the first classifying module 650, and the training module 660, or the third extracting module 710, the second generating module 720, the determining module 730, the fourth extracting module 740, and the second classifying module 750 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to the embodiment of the present invention, at least one of the obtaining module 610, the first extracting module 620, the first generating module 630, the second extracting module 640, the first classifying module 650, the training module 660, or the third extracting module 710, the second generating module 720, the determining module 730, the fourth extracting module 740, and the second classifying module 750 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementation manners of software, hardware, and firmware, or by a suitable combination of any several of them. Alternatively, at least one of the obtaining module 610, the first extracting module 620, the first generating module 630, the second extracting module 640, the first classifying module 650, the training module 660, or the third extracting module 710, the second generating module 720, the determining module 730, the fourth extracting module 740, the second classifying module 750 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

Fig. 8 schematically shows a block diagram of an electronic device adapted to implement the object localization model training method or the object localization method according to an embodiment of the present invention.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present invention includes a processor 801 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.

In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 802 and/or the RAM 803. Note that the programs may also be stored in one or more memories other than the ROM 802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.

Electronic device 800 may also include input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804, according to an embodiment of the invention. Electronic device 800 may also include one or more of the following components connected to I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present invention, a computer-readable storage medium may include the ROM 802 and/or the RAM 803 described above and/or one or more memories other than the ROM 802 and the RAM 803.

Embodiments of the invention also include a computer program product comprising a computer program comprising program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to implement the object localization model training method or the object localization method provided by the embodiments of the present invention.

The computer program performs the above-described functions defined in the system/apparatus of the embodiment of the present invention when executed by the processor 801. The above described systems, devices, modules, units, etc. may be implemented by computer program modules according to embodiments of the invention.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via communication section 809, and/or installed from removable media 811. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the system of the embodiment of the present invention. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.

According to embodiments of the present invention, program code for executing a computer program provided by embodiments of the present invention may be written in any combination of one or more programming languages, and in particular, the computer program may be implemented using a high level procedural and/or object oriented programming language, and/or an assembly/machine language. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be appreciated by a person skilled in the art that various combinations and/or combinations of features described in the various embodiments and/or in the claims of the invention are possible, even if such combinations or combinations are not explicitly described in the invention. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present invention may be made without departing from the spirit or teaching of the invention. All such combinations and/or associations fall within the scope of the present invention.

The embodiments of the present invention have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the invention is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the invention, and these alternatives and modifications are intended to fall within the scope of the invention.

Claims

1. An object localization model training method, comprising:

acquiring a sample data set, wherein the sample data set comprises a plurality of image samples, each image sample comprises image data, and each image sample has an image category label;

inputting each image sample into a pixel feature extraction layer of an initial model, and outputting the pixel feature of each pixel point in the image data;

inputting first feature data and second feature data into an activation map generation layer of the initial model, and outputting a foreground activation map and a background activation map, wherein the first feature data comprises the pixel features and preset foreground prototype data, and the second feature data comprises the pixel features and preset background prototype data; the foreground activation map comprises a first similarity, the first similarity characterizes the similarity between the pixel feature of each pixel point and preset foreground prototype data, the background activation map comprises a second similarity, and the second similarity characterizes the similarity between the pixel feature of each pixel point and preset background prototype data;

inputting third feature data and fourth feature data into a perceptual feature extraction layer of the initial model respectively, and outputting foreground perceptual features and background perceptual features, wherein the third feature data comprise the pixel features and the foreground activation map, and the fourth feature data comprise the pixel features and the background activation map;

inputting fifth feature data and sixth feature data into a classification layer of the initial model respectively, and outputting a classification result, wherein the fifth feature data comprises the foreground perception feature and preset category prototype data, the sixth feature data comprises the background perception feature and preset category prototype data, and the preset category prototype data represents the category of a target object partial region; and

and adjusting the model parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model.

2. The method of claim 1, wherein said inputting said pixel features into an activation map generation layer of said initial model, outputting a foreground activation map and a background activation map, comprises:

inputting the pixel characteristics into an activation map generation layer of the initial model, and outputting an initial foreground activation map and an initial background activation map;

calculating a semantic loss matrix of the pixel features by using an optimal transmission algorithm, wherein the semantic loss matrix represents optimal distribution results of the pixel features in the foreground and the background;

and adjusting model parameters of an activation map generation layer of the initial model according to the semantic loss matrix and the initial activation matrix so as to output the optimized foreground activation map and the optimized background activation map.

3. The method according to claim 2, wherein the classification layer of the initial model comprises a class feature extraction layer and a class matching layer, and the inputting the fifth feature data and the sixth feature data into the classification layer of the initial model respectively and outputting the classification result comprises:

inputting the fifth feature data and the sixth feature data into the category feature extraction layer respectively, and outputting foreground category features and background category features;

inputting the foreground category characteristics and the foreground perception characteristics into the category matching layer and outputting a foreground classification result;

and inputting the background category characteristics and the background perception characteristics into the category matching layer, and outputting a background classification result.

4. The method of claim 3, wherein the adjusting the model parameters of the initial model according to the classification result and the image class label to obtain a trained object localization model comprises:

and adjusting the model parameters of the initial model according to the classification result and the image category label by using the model loss function to obtain a trained target positioning model.

5. A method of target localization, comprising:

inputting an image to be detected into a pixel feature extraction layer of a target positioning model, and outputting the pixel feature to be detected of each pixel point in the image to be detected, wherein the target positioning model is obtained by adopting the training method of any one of claims 1 to 4;

inputting the pixel feature to be detected and preset foreground prototype data into an activation map generation layer of the target positioning model, and outputting a foreground activation map, wherein the foreground activation map comprises a third similarity, and the third similarity represents the similarity between the pixel feature to be detected and the preset foreground prototype data;

determining a target positioning area according to the third similarity in the foreground activation map;

inputting the pixel feature to be detected and the foreground activation map into a perception feature extraction layer of the target positioning model, and outputting a foreground perception feature;

and inputting the foreground perception features and preset category prototype data into a classification layer, and outputting a classification result of the image to be detected, wherein the preset category prototype data represents the category of a partial region of the target object.

6. The method of claim 5, wherein said determining a target location area from said third similarity in said foreground activation map comprises:

extracting a plurality of target pixel points from the image to be detected according to the foreground activation image, wherein the target pixel points represent pixel points corresponding to target similarity in the foreground activation image, and the target similarity represents similarity larger than a preset threshold in the third similarity;

and determining an adjacent graphic area covering the target pixel points as the target positioning area.

7. An object localization model training apparatus comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a sample data set, the sample data set comprises a plurality of image samples, each image sample comprises image data, and the image samples are provided with image category labels;

the first extraction module is used for inputting each image sample into a pixel feature extraction layer of an initial model and outputting the pixel feature of each pixel point in the image data;

a first generation module, configured to input first feature data and second feature data into an activation map generation layer of the initial model, and output a foreground activation map and a background activation map, where the first feature data includes the pixel feature and preset foreground prototype data, and the second feature data includes the pixel feature and preset background prototype data; the foreground activation image comprises a first similarity, the first similarity characterizes the similarity between the pixel feature of each pixel point and preset foreground prototype data, the background activation image comprises a second similarity, and the second similarity characterizes the similarity between the pixel feature of each pixel point and preset background prototype data;

a second extraction module, configured to input third feature data and fourth feature data into a perceptual feature extraction layer of the initial model, respectively, and output a foreground perceptual feature and a background perceptual feature, where the third feature data includes the pixel feature and the foreground activation map, and the fourth feature data includes the pixel feature and the background activation map;

the first classification module is used for respectively inputting fifth feature data and sixth feature data into a classification layer of the initial model and outputting a classification result, wherein the fifth feature data comprises the foreground perception feature and preset class prototype data, the sixth feature data comprises the background perception feature and preset class prototype data, and the preset class prototype data represents the class of a target object partial region; and

and the training module is used for adjusting the model parameters of the initial model according to the classification result and the image category label to obtain a trained target positioning model.

8. An object locating device comprising:

a third extraction module, configured to input an image to be detected into a pixel feature extraction layer of a target positioning model, and output a to-be-detected pixel feature of each pixel point in the image to be detected, where the target positioning model is obtained by using the training method according to any one of claims 1 to 4;

the second generation module is used for inputting the pixel feature to be detected and preset foreground prototype data into an activation map generation layer of the target positioning model and outputting a foreground activation map, wherein the foreground activation map comprises a third similarity, and the third similarity represents the similarity between the pixel feature to be detected and the preset foreground prototype data;

the determining module is used for determining a target positioning area according to the third similarity in the foreground activation map;

the fourth extraction module is used for inputting the pixel characteristics to be detected and the foreground activation map into a perception characteristic extraction layer of the target positioning model and outputting foreground perception characteristics;

and the second classification module is used for inputting the foreground perception features and preset class prototype data into a classification layer and outputting a classification result of the image to be detected, wherein the preset class prototype data represents the class of a partial region of the target object.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-4 or 5-6.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 4 or 5 to 6.