CN113112518A

CN113112518A - Feature extractor generation method and device based on spliced image and computer equipment

Info

Publication number: CN113112518A
Application number: CN202110419268.9A
Authority: CN
Inventors: 陈鹏光; 刘枢; 贾佳亚; 沈小勇; 吕江波
Original assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Current assignee: Shenzhen Smartmore Technology Co Ltd; Shanghai Smartmore Technology Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-13
Anticipated expiration: 2041-04-19
Also published as: CN113112518B

Abstract

The application relates to a feature extractor generation method, a feature extractor generation device, a computer device and a storage medium based on a spliced image, wherein a sample image is divided into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same; randomly combining image blocks of different sample images to obtain a spliced image consisting of a preset number of image blocks; performing combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the spliced image, and taking the trained feature extractor as a feature extractor based on the spliced image; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image. The method and the device improve the utilization efficiency of the sample images, and improve the training effect and the generation efficiency of the feature extractor.

Description

Feature extractor generation method and device based on spliced image and computer equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for generating a feature extractor based on a stitched image, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, unsupervised visual characterization learning technology appears, and the goal is to pre-train the feature extractor without human supervision, so that the generated feature extractor can be transferred to other tasks to run, and the completion period of the tasks is shortened.

In the prior art, each training batch and the extended version thereof need to be processed simultaneously during each training, and the generation efficiency of the conventional feature extractor is low due to more resources required in the generation process of the feature extractor.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for generating a feature extractor based on a stitched image.

A method for feature extractor generation based on stitched images, the method comprising:

dividing a sample image into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same;

randomly combining the image blocks of different sample images to obtain a spliced image consisting of the image blocks of the preset number;

performing combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the spliced image until the trained feature extractor meets a preset training condition, and taking the trained feature extractor as a feature extractor based on the spliced image; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image.

In one embodiment, the dividing the sample image into a preset number of image blocks includes:

determining the segmentation area of the image block according to the side length of the sample image;

and segmenting the sample image according to the segmentation area of the image blocks to obtain a preset number of image blocks corresponding to the sample image.

In one embodiment, the randomly combining the image blocks of different sample images to obtain a stitched image composed of the preset number of image blocks includes:

generating a spliced image template; the spliced image template comprises the preset number of vacant positions;

randomly selecting the preset number of image blocks from the image blocks as target image blocks; the image blocks in the preset number are derived from different sample images;

and filling the target image block into the vacancy of the spliced image template to obtain the spliced image.

In one embodiment, the jointly training the feature extractor, the clustering branch network, and the positioning branch network in the pre-constructed feature extractor training model includes:

inputting the spliced image into a pre-constructed feature extractor training model, and extracting the image features of the spliced image through a feature extractor in the feature extractor training model;

respectively transmitting the image features to the clustering branch network and the positioning branch network, so that the clustering branch network outputs clustering prediction results aiming at the image features, and the positioning branch network outputs positioning prediction results aiming at the image features;

and training the feature extractor based on the clustering prediction result, the positioning prediction result, the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network.

In one embodiment, the training the feature extractor based on the cluster prediction result, the positioning prediction result, the loss function corresponding to the cluster branch network, and the loss function corresponding to the positioning branch network includes:

constructing a target loss function corresponding to the pre-constructed feature extractor training model according to the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network; the target loss function is used for adjusting parameters of the feature extractor based on the clustering prediction result and the positioning prediction result in a training process.

In one embodiment, the respectively transferring the image features to the clustering branch network and the positioning branch network includes:

decoupling the image features to obtain the preset number of target image features;

and respectively transmitting the preset number of target image features to the clustering branch network and the positioning branch network.

In one embodiment, the decoupling the image features to obtain the preset number of target image features includes:

carrying out interpolation processing on the image characteristics to obtain amplified image characteristics;

and performing down-sampling processing on the amplified image features to obtain the preset number of target image features.

An apparatus for feature extractor generation based on stitched images, the apparatus comprising:

the image segmentation module is used for segmenting the sample image into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same;

the image splicing module is used for randomly combining the image blocks of different sample images to obtain a spliced image consisting of the preset number of image blocks;

the model training module is used for carrying out combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the spliced image until the trained feature extractor meets a preset training condition, and taking the trained feature extractor as a feature extractor based on the spliced image; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The feature extractor generation method based on the spliced image, the device, the computer equipment and the storage medium comprise the following steps: dividing a sample image into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same; randomly combining image blocks of different sample images to obtain a spliced image consisting of a preset number of image blocks; performing combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the spliced image until the trained feature extractor meets a preset training condition, and taking the trained feature extractor as a feature extractor based on the spliced image; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image. According to the method and the device, the sample image is divided into the image blocks which are mutually crossed and spliced to obtain the spliced image, so that the utilization efficiency of the sample image is improved; the pre-constructed feature extractor training model is trained through the spliced image, and the clustering branch network and the positioning branch network can improve the training effect and the generation efficiency of the feature extractor.

Drawings

FIG. 1 is a diagram of an application environment of a feature extractor generation method based on stitched images in one embodiment;

FIG. 2 is a schematic flow diagram of a method for feature extractor generation based on stitched images in one embodiment;

FIG. 3 is a flowchart illustrating a process of dividing a sample image into a predetermined number of image blocks according to an embodiment;

FIG. 4 is a flowchart illustrating a step of dividing a sample image into a predetermined number of image blocks according to an embodiment;

FIG. 5 is a flowchart illustrating a step of obtaining a stitched image composed of a predetermined number of image blocks according to an embodiment;

FIG. 6 is a diagram of a stitched image template, in one embodiment;

FIG. 7 is a schematic flow chart illustrating the joint training step performed on the network in the pre-constructed feature extractor training model in one embodiment;

FIG. 8 is a schematic diagram of a pre-constructed feature extractor training model in one embodiment;

FIG. 9 is a flow chart illustrating a method for generating a feature extractor based on a generic stitched image according to yet another embodiment;

FIG. 10 is a block diagram of an apparatus for feature extractor generation based on stitched images in one embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The feature extractor generation method based on the stitched image can be applied to the application environment shown in fig. 1. Wherein the terminal 11 communicates with the server 12 via a network. After the server 12 receives the sample image sent by the terminal 11, the server 12 divides the sample image into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same; the server 12 randomly combines the image blocks of different sample images to obtain a spliced image composed of a preset number of image blocks; the server 12 performs joint training on the feature extractor, the clustering branch network and the positioning branch network in the pre-constructed feature extractor training model based on the stitched image, and takes the trained feature extractor as a feature extractor based on the stitched image until the trained feature extractor meets the preset training condition; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image.

The terminal 11 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 12 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for generating a feature extractor based on a stitched image is provided, which is described by way of example as applied to the server 12 in fig. 1, and includes the following steps:

step 21, dividing the sample image into a preset number of image blocks; the adjacent image blocks in the same sample image include a common area, and the sizes of the image blocks are the same.

The sample image is an original training image used for generating the feature extractor, and may be input in the form of an image batch (batch), where n sample images are included in one image batch X, and X ═ X may be calculated₁,x₂,...,x_n}. Each sample image is divided into m x m image blocks, so that a total of n x m image blocks exists in an image batch; the division of a sample image into m × m image blocks is a preset number of sample image divisions, i.e. the preset number is related to the manner in which the sample image is divided, and when m is 2, the preset number isAnd 4, when m is 3, the preset number is 9, and the like.

The common area refers to a common area existing between the image block and the image block after the sample image is divided; fig. 3 is a schematic diagram illustrating a sample image divided into a predetermined number of image blocks; fig. 3-1 is a sample image without being divided, fig. 3-2 is a sample image divided into a predetermined number of image blocks, and fig. 3-3 is a schematic diagram of a sample image corresponding to a predetermined number of image blocks after being combined. It can be seen that when m is 2, the predetermined number m × m is 4, that is, the sample image shown in fig. 3-1 is divided into 4 image blocks having the same size and having a common region between adjacent image blocks, where 3a, 3b, 3c, and 3d are 4 divided regions corresponding to the predetermined number, and are also four divided image blocks at the same time; 3-1 after division, as shown in FIG. 3-2, it can be seen that there is a portion of the common area between 3a and 3b (horse bristle portion), between 3b and 3c (horse body front half, chest area), between 3c and 3d (horse body middle, abdomen area), and between 3d and 3a (horse body back half, tail area); furthermore, due to the area of the divided regions, the 4 image blocks also have a common region therebetween, i.e., the abdominal region of the horse. By comparing fig. 3-3 with fig. 3-1, it can be seen that fig. 3-3 is equivalent to enlarging the information amount of each region, and each image block is augmented with image information of other image blocks, such image blocks can enable the feature extractor to learn better characteristics during training.

Specifically, the server determines a segmentation mode of the sample image according to the preset number of segments, determines segmentation areas corresponding to the preset number, and segments the sample image according to the segmentation areas, so that the image blocks of the preset number after segmentation have the same size, and meanwhile, a common area is also included between adjacent image blocks.

The server in the step divides the sample image into a preset number of image blocks; the image blocks contain common areas and have the same size, so that the utilization efficiency of the sample images is improved, and the training effect of the feature extractor can be improved.

And step 22, randomly combining the image blocks of different sample images to obtain a spliced image consisting of a preset number of image blocks.

The random combination of the image blocks can adopt a random algorithm, so that the image blocks belonging to different sample images are combined together; further, the random algorithm may be further configured, for example, to combine image blocks from a certain number of sample images, for example, to obtain image blocks of two sample images for random combination, or to obtain image blocks of four sample images for random combination. The mosaic image is an image formed by combining a preset number of image blocks according to random positions.

Specifically, the server calls a random algorithm, and randomly combines image blocks of different sample images after disordering the sequence to obtain a new image batch X', X ═ X₁’,x₂’,…,x_n'}, i.e. the new image batch X', each picture is still composed of m X m image blocks, except that these image blocks come from different positions of different sample images.

In the step, the server obtains a new spliced image by disordering the sequence of the divided image blocks and randomly combining the image blocks; the spliced image is equivalent to the sample image which is expanded, but the total data volume is not improved, so that the model can obtain a better learning effect on the premise of less resources by training the model, and the training effect and the generation efficiency of the feature extractor are further improved.

Step 23, performing combined training on the feature extractor, the clustering branch network and the positioning branch network in the pre-constructed feature extractor training model based on the stitched image, and taking the trained feature extractor as a feature extractor based on the stitched image when the trained feature extractor meets the preset training condition; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image.

The pre-constructed feature extractor training model at least comprises a feature extractor, a clustering branch network and a positioning branch network; the feature extractor is a Backbone network (Backbone) in a pre-constructed feature extractor training model, and can extract features of an input spliced image to obtain feature information corresponding to the spliced image; the clustering branch network is used for identifying image blocks from the same sample image and aggregating the image blocks together; and the positioning branch network is used for predicting the position of each image block in the aggregated image, and finally, the two branch networks are combined to obtain a recovered image corresponding to the sample image.

The training condition refers to a condition required by the model to complete training in the training process, for example, when the loss value of the loss function is smaller than a certain value, the feature extractor can be judged to meet the preset training condition; and for example, when the matching accuracy of the output image of the pre-constructed feature extractor training model and the sample image reaches a certain value, the feature extractor can be judged to meet the preset training condition, and the like.

Specifically, the server inputs a new image batch X 'formed by randomly combined spliced images into a pre-constructed feature extractor training model, carries out combined training of a feature extractor, a clustering branch network and a positioning branch network, and continuously judges whether the feature extractor in the new image batch X' meets a preset training condition; and when the preset training condition is reached, taking the trained feature extractor as a feature extractor based on the spliced image, and determining that the feature extractor based on the spliced image is generated.

In the step, the server inputs the spliced image into a pre-constructed feature extractor training model, and trains the feature extractors contained in the spliced image until the feature extractors meet preset training conditions to obtain the feature extractor based on the spliced image. Through the combined training of the clustering branch network and the positioning branch network in the model, the feature extractor can obtain the capability of extracting corresponding image features; the feature identification and extraction efficiency of the feature extractor is improved.

It is emphasized that the present disclosure uses stitched images as input data for model training, rather than each individual image block; because the single image block is directly used as input, the model training task can be solved only through global information among images, which is the defect of the prior contrast learning method; smaller image blocks may increase the resolution difference between the pre-training task and the downstream task, resulting in poor overall performance; if the single image block is enlarged and then input, resources required by a pre-training task are increased; the use of stitched image input may well avoid these previous disadvantages: 1. the image splicing method only forms a new image batch with the same size as the original sample image, and compared with the existing double-batch method (the contrast learning needs to construct another batch of images with different 'visual angles' in the training process, and in order to construct a positive sample pair, each training batch and the extended version thereof need to be processed simultaneously in each training, so that the method is named as a double-batch method), only half of the training batches are spent in the training process; 2. in order to complete the training task better, the network must learn the features in the detailed stitched images to distinguish different image blocks in one image, and the features between the global stitched images to find different image blocks in the same original sample image.

According to the feature extractor generation method based on the spliced image, the sample image is divided into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same; randomly combining image blocks of different sample images to obtain a spliced image consisting of a preset number of image blocks; performing combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the spliced image until the trained feature extractor meets a preset training condition, and taking the trained feature extractor as a feature extractor based on the spliced image; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image. According to the method and the device, the sample image is divided into the image blocks which are mutually crossed and spliced to obtain the spliced image, so that the utilization efficiency of the sample image is improved; the pre-constructed feature extractor training model is trained through the spliced image, and the clustering branch network and the positioning branch network can improve the training effect and the generation efficiency of the feature extractor.

In one embodiment, as shown in fig. 4, the step 21 of dividing the sample image into a preset number of image blocks includes:

step 41, determining the segmentation area of the image block according to the side length of the sample image;

and 42, segmenting the sample image according to the segmentation area of the image block to obtain a preset number of image blocks corresponding to the sample image.

The segmentation area of the image block can be determined according to the side length of the sample image; for example, if the sample image is a square image with a side length of a, and the division side length of the image block is set to 0.6 of the side length of the sample image, the division side length of the image block is 0.6a, and the division area of the image block is 0.6a × 0.6 a.

If the sample image is a rectangular image with sides a, b, the image block may have a division area of 0.6a × 0.6 b. That is, the side length of the sample image and the segmentation side length of the image block have a proportional relationship, the proportion can be set according to actual needs, and two or more proportions can be different from each other.

A ratio of more than 0.5 in the square sample image ensures that there is a common area between the image blocks, so 0.6 is used for segmentation in the above example. The shape of the sample image is not limited to a square or a rectangle, and any shape image that can be divided into image blocks may be used as the sample image.

Specifically, the server obtains a preset segmentation ratio of the sample image and the image block, and determines the segmentation area of the image block according to the ratio; after the segmentation area is determined, it is equivalent to obtaining one mask, and a preset number of segmentation extractions can be performed on the sample image, for example, the server performs four segmentation extractions from four corners of the sample image according to the mask of the segmentation area, so as to obtain four image blocks, as shown in fig. 3-1 and 3-2.

The server determines the division area of the image block according to the side length of the sample image, and divides the image block according to the area to obtain the image blocks which have the same area and mutually exist a common area. The whole operation is simple and convenient, the resource utilization is small, and the utilization efficiency of the sample image is improved.

In an embodiment, as shown in fig. 5, in step 22, randomly combining image blocks of different sample images to obtain a stitched image composed of a preset number of image blocks includes:

step 51, generating a spliced image template; the spliced image template comprises a preset number of vacant positions;

step 52, randomly selecting a preset number of image blocks from the image blocks as target image blocks; the image blocks in the preset number are derived from different sample images;

and 53, filling the target image blocks into the vacant positions of the spliced image template to obtain a spliced image.

Fig. 6 is a schematic diagram of a stitched image template including a preset number of 4 vacancies; the spliced image template is generated according to the segmented image blocks, and the size and the shape of each vacancy are matched with the image blocks; the target image block refers to an image block to be filled in a vacancy of the spliced image template selected from the plurality of image blocks; for example, 4 image blocks are selected from 100 image blocks and filled into the stitched image template shown in fig. 6, and then the selected 4 image blocks are the target image blocks.

Specifically, the server generates a corresponding spliced image template according to the preset number and the size of the divided image blocks; and randomly selecting image blocks corresponding to the preset number from the image blocks one by one to serve as target image blocks, and randomly filling the vacancy of the spliced image template to obtain a spliced image.

The server determines the carrier of the target image block by generating the spliced image template, and the randomness of the image block selection is improved in the target image block selection process.

In one embodiment, as shown in fig. 7, the step 23 of performing joint training on the feature extractor, the clustering branch network and the positioning branch network in the pre-constructed feature extractor training model includes:

step 71, inputting the spliced image into a pre-constructed feature extractor training model, and extracting the image features of the spliced image through a feature extractor in the feature extractor training model;

step 72, respectively transmitting the image features to the clustering branch network and the positioning branch network, so that the clustering branch network outputs clustering prediction results aiming at the image features, and the positioning branch network outputs positioning prediction results aiming at the image features;

and 73, training the feature extractor based on the clustering prediction result, the positioning prediction result, the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network.

FIG. 8 is a schematic diagram of a pre-constructed training model of a feature extractor; in the schematic diagram, solid line boxes represent neural networks, such as a feature extractor, a decoupling module, a clustering branch neural network, and a positioning branch neural network, and dashed line boxes represent features obtained after processing by the neural networks, such as n-numbered features and n × m × m-numbered decoupled features.

Representing a loss function corresponding to the clustered branch neural network,

representing the loss function corresponding to the positioning of the branched neural network. The decoupling module is used for decomposing the n characteristics into m × m parts respectively to obtain n × m × m characteristics.

In particular, the feature extractor in the pre-constructed feature extractor training model may be a common network body structure (Backbone), such as ResNet. The decoupling module can decompose n features extracted by the feature extractor into m × m parts, namely n × m × m features are obtained and respectively correspond to different image blocks in a spliced image. The clustering branch network is used for carrying out supervised clustering on the characteristics of each image block and outputting a clustering prediction result aiming at the image characteristics; and positioning each image block by the positioning branch network to obtain a positioning prediction result aiming at the image characteristics. And the predicted recovered sample image output by the model can be obtained by integrating the clustering prediction result and the positioning prediction result. And comparing the real sample image with the predicted recovered sample image by using a loss function, adjusting parameters in the model, particularly network parameters of the feature extractor, and realizing training of the feature extractor until a preset training condition is met.

The task of the clustering branch network is a supervised clustering task, and m multiplied by m image blocks in the same sample image can be identified through the clustering branch network; the goal of a clustering branching network may be understood as pulling objects from the same class together and moving objects from different classes apart; the cosine similarity measure can thus be used to measure the distance between objects, i.e. for two image blocks i and j from the same sample image, the loss function is:

wherein log is a natural logarithm function, exp () is an exponential function with a natural constant e as a base, cos () is cosine similarity, z_i、z_jFor image blocks from the same sample image, τ is a smoothing function (temperature parameter),

to indicate a function, z_i、z_kAre image blocks from different sample images.

The final loss function requires the combination of all image blocks from the same sample image, i.e.

Wherein,

loss function representing a clustering branch network, C_iSubscript representing image block with cluster category i;

defining the task of the positioning branch network into a classification task, wherein the loss function can adopt a common cross entropy loss function, and formalized definition is as follows:

；

wherein,

to locate the loss function of the branch network, L is the predicted position of the image block, L_gtThe true position of the image block.

In one embodiment, the training the feature extractor in step 73 based on the clustering prediction result, the positioning prediction result, the loss function corresponding to the clustering branch network, and the loss function corresponding to the positioning branch network includes: constructing a target loss function corresponding to a pre-constructed feature extractor training model according to the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network; the objective loss function is used to adjust parameters of the feature extractor based on the cluster prediction result and the location prediction result during the training process.

Specifically, constructing an objective loss function representation corresponding to a pre-constructed feature extractor training model as

Namely, the goal of the training task of the pre-constructed feature extractor training model is to optimize the objective loss function value:

wherein,

the function of the loss of the target is expressed,

a loss function representing a network of clustering branches,

and the loss function of the positioning branch network is expressed, and alpha and beta are hyper-parameters for respectively balancing the clustering branch network and the positioning branch network.

In one embodiment, the step 72 of respectively transferring the image features to the clustering branch network and the positioning branch network includes: decoupling the image features to obtain a preset number of target image features; and respectively transmitting the preset number of target image characteristics to the clustering branch network and the positioning branch network.

Further, in an embodiment, the decoupling processing is performed on the image features to obtain a preset number of target image features, and the method includes: carrying out interpolation processing on the image characteristics to obtain amplified image characteristics; and performing down-sampling processing on the amplified image features to obtain a preset number of target image features.

Specifically, as shown in fig. 8, the decoupling module firstly performs decoupling processing on n image features output by the feature extractor, decouples the n image features into n × m × m features, and respectively transmits the n × m × m features to the clustering branch network and the positioning branch network. The decoupling module may interpolate the features output by the feature extractor into a new feature having a side length of a multiple of m (m is a variable related to a preset number), and perform an amplification process on the features instead of a reduction process, thereby avoiding information loss.

For example, the feature extractor generates a feature map of size 7 × 7; when m is 2, useThe bilinear interpolation method is used for interpolating the feature mapping into a multiple of m, such as 8 multiplied by 8, so that the feature map is amplified; then, the average pooling method is adopted to down-sample the features to n multiplied by m multiplied by c and decompose the features into (n multiplied by m) multiplied by c (c represents the feature vector dimension of the pixel point), and then (n multiplied by m) c dimension vectors are obtained. Then each vector is mapped into a c-dimensional space through a two-layer perception network of the clustering branch network to form a group of vectors Z ═ Z for the clustering branch network₁,z₂,...,z_nmm(ii) a And forming a set of vectors L ═ L for the positioning branch network by using the full connection layer of the positioning branch network as a classifier₁,l₂,…,l_nmm。

In one embodiment, as shown in fig. 9, a flowchart of a method for generating a feature extractor based on a generic stitched image is illustrated;

a generic stitched image can be obtained by: dividing the sample image into a preset number of image blocks, wherein adjacent image blocks in the same sample image do not comprise a common area, and the sizes of the image blocks are the same; randomly combining image blocks of different sample images to obtain a spliced image consisting of a preset number of image blocks, namely a common spliced image; i.e. common stitched images do not contain common areas between the image blocks.

In the schematic diagram, fig. 9-1 is an input original sample image, and fig. 9-2 is a schematic diagram of dividing the input original sample image into a predetermined number of image blocks. Fig. 9-3 is a schematic diagram of randomly combining a preset number of image blocks, and fig. 9-4 is a schematic diagram of a general stitched image generated by randomly combining the image blocks.

It can be seen that the model takes a generic stitched image as shown in fig. 9-4, and the task of the model is to revert the generic stitched image back to the sample image shown in fig. 9-1.

It should be understood that although the steps in the flowcharts of fig. 2, 4, 5, and 7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, 5, and 7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 10, there is provided a feature extractor generating apparatus based on a stitched image, including: an image segmentation module 101, an image stitching module 102, and an image stitching module 103, wherein:

an image segmentation module 101, configured to segment a sample image into a preset number of image blocks; adjacent image blocks in the same sample image comprise a common area, and the sizes of the image blocks are the same;

the image splicing module 102 is configured to randomly combine image blocks of different sample images to obtain a spliced image composed of a preset number of image blocks;

the model training module 103 is used for performing combined training on a feature extractor, a clustering branch network and a positioning branch network in a pre-constructed feature extractor training model based on the stitched image, and taking the trained feature extractor as a feature extractor based on the stitched image when the trained feature extractor meets a preset training condition; the clustering branch network is used for aggregating image blocks belonging to the same sample image; the positioning branch network is used for determining the positions of the image blocks belonging to the same sample image in the aggregated image.

In one embodiment, the image segmentation module 101 is further configured to determine a segmentation area of the image block according to a side length of the sample image; and segmenting the sample image according to the segmentation area of the image blocks to obtain a preset number of image blocks corresponding to the sample image.

In one embodiment, the image stitching module 102 is further configured to generate a stitched image template; the spliced image template comprises a preset number of vacant positions; randomly selecting a preset number of image blocks from the image blocks as target image blocks; the image blocks in the preset number are derived from different sample images; and filling the target image block into the vacancy of the spliced image template to obtain a spliced image.

In one embodiment, the model training module 103 is further configured to input the stitched image into a pre-constructed feature extractor training model, and extract image features of the stitched image through a feature extractor in the feature extractor training model; respectively transmitting the image features to a clustering branch network and a positioning branch network, so that the clustering branch network outputs clustering prediction results aiming at the image features, and the positioning branch network outputs positioning prediction results aiming at the image features; and training the feature extractor based on the clustering prediction result, the positioning prediction result, the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network.

In one embodiment, the model training module 103 is further configured to construct a target loss function corresponding to the pre-constructed feature extractor training model according to the loss function corresponding to the clustering branch network and the loss function corresponding to the positioning branch network; the objective loss function is used to adjust parameters of the feature extractor based on the cluster prediction result and the location prediction result during the training process.

In one embodiment, the model training module 103 is further configured to perform decoupling processing on the image features to obtain a preset number of target image features; and respectively transmitting the preset number of target image characteristics to the clustering branch network and the positioning branch network.

In one embodiment, the model training module 103 is further configured to perform interpolation processing on the image features to obtain amplified image features; and performing down-sampling processing on the amplified image features to obtain a preset number of target image features.

For specific limitations of the feature extractor generation apparatus based on the stitched image, reference may be made to the above limitations of the feature extractor generation method based on the stitched image, and details are not repeated here. The modules in the device for generating the feature extractor based on the stitched image may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store feature extractor generated data based on the stitched image. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for feature extractor generation based on stitched images.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for generating a feature extractor based on a stitched image, the method comprising:

2. The method of claim 1, wherein the dividing the sample image into a preset number of image blocks comprises:

3. The method according to claim 1, wherein the randomly combining the image blocks of different sample images to obtain a stitched image composed of the preset number of image blocks comprises:

4. The method of claim 1, wherein jointly training the feature extractors, the clustering branch networks and the positioning branch networks in the pre-constructed feature extractor training model comprises:

5. The method of claim 4, wherein training the feature extractor based on the cluster prediction result, the location prediction result, the loss function corresponding to the cluster branch network, and the loss function corresponding to the location branch network comprises:

6. The method of claim 4, wherein said communicating said image features to said clustering branch network and said localization branch network, respectively, comprises:

7. The method of claim 6, wherein the decoupling the image features to obtain the preset number of target image features comprises:

8. An apparatus for generating a feature extractor based on a stitched image, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.