CN116977678A

CN116977678A - Image processing method, image processing apparatus, electronic device, storage medium, and program product

Info

Publication number: CN116977678A
Application number: CN202211651467.3A
Authority: CN
Inventors: 王昌安
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-10-31

Abstract

The embodiment of the application provides an image processing method, an image processing device, electronic equipment, a storage medium and a program product, relates to the technical field of artificial intelligence, and can be applied to a scene for executing a target counting task. The method comprises the following steps: inputting the target image into a pre-trained target counting model to obtain a target density map corresponding to the target object, and determining a total counting value of the target object included in the target image based on the target density map; training of the target counting model includes: after clustering training is carried out on the model by adopting a first sample image without labels, a second sample image with labels is input into the model, and the model is updated based on a prediction category determined by the model and a real category of the second sample image; the true class is determined based on the count value of the second sample image. The target counting model can adopt the non-marked data to perform clustering training, reduces the cost of marking a large amount of non-marked data, and is beneficial to improving the generalization capability of the counting algorithm.

Description

Image processing method, image processing apparatus, electronic device, storage medium, and program product

Technical Field

The present application relates to the field of artificial intelligence technology and image processing technology, and in particular, to an image processing method, apparatus, electronic device, storage medium, and program product.

Background

The object counting task estimation may automatically infer the total number of objects in the image. Most of the existing target counting algorithms are combined with density map regression, and end-to-end training and reasoning are carried out by using a deep learning technology, so that the problems of large target density distribution range and large target scale change range can be well solved, and the counting precision is improved to a certain extent. However, the deep learning-based method is often dependent on a large amount of labeling data for supervised training, which results in great challenges for training the target counting model in a large-scale open scene, and high labeling cost is caused by the need of labeling a large amount of data, so that the advantages of the model are difficult to fully develop. In addition, for labeling of target counts, the distance between targets is small due to dense distribution of targets, resulting in higher cost of labeling.

Disclosure of Invention

Embodiments of the present application provide an image processing method, apparatus, electronic device, storage medium, and program product for solving at least one technical problem described above. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an image processing method, including:

inputting a target image into a pre-trained target counting model, obtaining a target density map corresponding to a target object through the target counting model, and determining a total counting value of the target object included in the target image based on the target density map; the target density map indicates count values of the unit pixels corresponding to the target object at corresponding positions;

The training step of the target counting model comprises the following steps: clustering training is carried out on the model in a preset counting interval range by adopting a first sample image without labels; the preset counting interval range indicates a plurality of preset categories and the proxy counting value of each category;

inputting a second sample image with labels into a model, and updating the model based on the prediction category determined by the model and the real category of the second sample image; the true category is determined by an agent count value determined based on the second sample image.

In a possible embodiment, the obtaining, by the target counting model, a target density map corresponding to the target object includes:

generating a corresponding first response graph aiming at each target center point corresponding to a target object in the target image;

adding all the first response graphs to obtain a target response graph;

and carrying out convolution operation on the target response graph through normalized Gaussian kernel to obtain a target density graph.

In a possible embodiment, the determining, based on the target density map, a total count value of the target object included in the target image includes at least one of:

Integrating calculation is carried out on the target density map, and a total count value corresponding to a target object is obtained;

determining a predicted object class corresponding to each sub-density map in the target density map, determining an agent count value of the class corresponding to the predicted object class in a preset counting interval range as a sub-count value of the sub-density map, and summing the sub-count values of all the sub-density maps to obtain a total count value of the target object in the target image; the sub-density map corresponds to an image block obtained by dividing the target image by the target counting model.

In a possible embodiment, the clustering training of the model using the unlabeled first sample image includes:

acquiring a first counting model obtained by training based on a third sample image with labels;

and executing the following iterative operation on the first counting model until a preset iterative stopping condition is reached:

clustering training is carried out on the first counting model by adopting a first sample image without labels;

and updating the network parameters of the first counting model based on the posterior probability of the clustering result to obtain a second counting model.

In a possible embodiment, the clustering training of the first count model using the unlabeled first sample image includes:

Dividing the first sample image into at least two image blocks based on a preset dividing mode, and executing the following operation for each image block in the first sample image so as to enable the clustering center to converge:

extracting characteristic information of a current image block;

determining feature embedding of the current image block based on the feature information and coordinate information of the current image block in the first sample image;

determining a cluster to which the current image block belongs based on the feature embedding;

based on the clustering result of the current image block, updating the clustering center of the cluster.

In a possible embodiment, the updating the network parameters of the first counting model based on the posterior probability of the clustering result to obtain a second counting model includes:

determining posterior probability of each image block belonging to the current cluster;

determining a log-likelihood value for each image block based on the posterior probability;

maximizing posterior probability of each image block belonging to the current cluster in a gradient descending manner based on log likelihood values of all image blocks of the current first sample image;

and updating the network parameters of the first counting model based on the maximized posterior probability to obtain a second counting model.

In a possible embodiment, the inputting the second sample image with the label into the model, and updating the model based on the predicted category label determined by the model and the real category label of the second sample image includes:

inputting a second sample image into the second counting model, obtaining a prediction density map corresponding to the second sample image through the second counting model, and a prediction category corresponding to each patch in the prediction density map, wherein the patch corresponds to an image block obtained by dividing the second sample image by the second counting model;

respectively determining the count value of each patch, and determining the real type of the corresponding patch based on the count value according to the preset count interval range;

determining a loss value based on the true category and the predicted category;

and updating network parameters of the second counting model based on the loss value to obtain a target counting model.

In a possible embodiment, the determining the count value of each patch separately, and determining the real class of the corresponding patch based on the count value according to the preset count interval range includes:

determining the average value of the real count values of patches belonging to the same category, and determining the average value as the proxy count value of the corresponding category;

For each patch, determining the category of the agent count value in the range of the count interval corresponding to the count value of the patch as the true category of the patch.

In a possible embodiment, the network structure of the target counting model includes at least two downsampling layers and at least two upsampling layers, a pooling layer is arranged between the downsampling layers, and a jump link is arranged between the upsampling layers and the downsampling layers;

and the target counting model outputs a target density map through the last upsampling layer.

In a second aspect, an embodiment of the present application provides an image processing apparatus including:

the counting module is used for inputting the target image into a pre-trained target counting model, obtaining a target density map corresponding to a target object through the target counting model, and determining a total counting value of the target object included in the target image based on the target density map; the target density map indicates count values of the unit pixels corresponding to the target object at corresponding positions;

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the image processing method provided in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image processing method provided in the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the image processing method provided in the first aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

the embodiment of the application provides an image processing method, in particular to an image processing method, when a target counting task is executed, an acquired target image can be input into a pre-trained target counting model, a target density map corresponding to a target object is obtained through the target counting model, the target density map can indicate a count value of a unit pixel corresponding to the target object at a corresponding position, and then a total count value of the target object included in the target image can be determined based on the target density map; the target counting model can directly adopt unlabeled sample images to perform clustering training within a preset counting interval range, and then correct the model based on the labeled sample images; the preset counting interval range indicates a plurality of preset categories and agent counting values of each category; the true class of the sample image that has been annotated can be determined from the proxy count value for that sample image; on the basis, the model can be updated by predicting the prediction type of the sample image and the real type of the sample image through the model; by decoupling model parameter learning and agent count value acquisition, feature learning is performed based on a large amount of unlabeled data, so that image blocks with similar count values are divided into the same cluster in a feature space, and then the agent count value of each class is corrected based on count value classification of labeled sample images, so that the cost of labeling a large amount of data can be reduced, and the generalization capability of a counting algorithm is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an operation environment of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic labeling diagram of a target counting task according to an embodiment of the present application;

FIG. 4 is a schematic image of an input object count model according to an embodiment of the present application;

FIG. 5 is a true density map corresponding to FIG. 4;

FIG. 6 is a density map predicted by the target count model corresponding to FIG. 4;

FIG. 7 is a schematic diagram of a target counting model according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating robustness comparison according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) which is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and expand the environment, sense the environment, acquire knowledge and use knowledge to obtain optimal results, methods, techniques and applications. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The image processing method provided by the embodiment of the application particularly relates to Machine Learning (ML), which is a multi-domain intersection subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Specifically, the application can perform unsupervised feature learning based on a large amount of unlabeled data, discover image blocks with different target densities in a feature clustering mode, and realize accurate target counting capability by performing model correction on a small amount of labeled data.

The image processing method provided by the embodiment of the application can be applied to a scene for executing the target counting task.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

Fig. 2 is a schematic diagram of an operation environment of an image processing method according to an embodiment of the present application, where the environment may include a terminal 20 and a server 10.

Wherein the terminal 20 may run a client or a service platform. Terminals (which may also be referred to as devices) may be, but are not limited to, smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices (e.g., smart speakers), wearable electronic devices (e.g., smart watches), vehicle terminals, smart appliances (e.g., smart televisions), AR/VR devices, and the like. Alternatively, the terminal 20 may perform the image processing method provided by the embodiment of the present application.

The server 10 may be an independent physical server, a server cluster or a distributed system (such as a distributed cloud storage system) formed by a plurality of physical servers, or a cloud server that provides cloud computing services. Alternatively, the server 10 may perform the image processing method provided by the embodiment of the present application.

In a possible embodiment, the terminal 20 and the server 10 may be directly or indirectly connected through wired or wireless communication, which is not limited herein. Such as terminal 20 uploading the target image to server 10 via network 30.

In a possible embodiment, the operating environment may also include a database that may be used to store data that the server 10 needs to store during image processing.

The following describes an image processing method provided in an embodiment of the present application.

Specifically, the image processing method provided by the embodiment of the application can be applied to a scene of executing a target counting task, as shown in fig. 3, the target counting task can be completed by labeling (such as triangle markers in fig. 3) the number of puppies included in the graph if the target counting task needs to be estimated.

As shown in fig. 1, the image processing method provided by the embodiment of the present application includes the following step S101:

step S101: and inputting the target image into a pre-trained target counting model, obtaining a target density map corresponding to the target object through the target counting model, and determining the total count value of the target object included in the target image based on the target density map.

The target image may be a single image input by the operation object, or may be a certain frame image in a video. The mode of acquiring the target image can be active or passive; for a video content selection scene, the content of a corresponding image frame can be actively acquired as a target image in a continuously recorded video based on a preset period, so as to estimate the number of target objects in a picture of a certain time node; as for a certain biological experiment, an image observed from a monocot microscope can be used as a target image to determine how many target cells are included in the image.

Wherein the target density map indicates count values of the unit pixels corresponding to the target object at the corresponding positions. It will be appreciated that the density map, i.e., the density distribution thermodynamic diagram, may reflect the average target number of corresponding locations of the unit pixels in an actual scene.

The target counting task is completed through the pre-trained target counting model, and it can be understood that the model is suitable for different application scenes, and different sample data can be adopted for model training, so that the model obtains better generalization capability while the model counting accuracy is improved. For example, for a cell counting scene, different types of cell sample data can be used for model training, so that the model can accurately estimate the number of target cells on the basis of distinguishing different types of cells.

The object counting model may be an object counting algorithm based on a deep learning technology, for an input object image, image features are extracted through a deep convolution network, and because the object counting task needs context features with high semantic information and local detail information, in order to obtain a high-resolution feature map with both high semantic information and detail information, a U-shaped network structure of which downsampling is performed before upsampling as shown in fig. 7 can be adopted, jump links are introduced to introduce detail information for upsampling, and finally, a predicted output object density map (which can also be called thermodynamic diagram) is used; the network structure adopted by the object counting model will be specifically described in the following embodiments.

Alternatively, the model may output the target density map including the total count value (target number) of the target objects included in the estimated target image, or may output only the target density map, and then determine the target number included in the target density map using other algorithms or software programs.

The training of the target counting model comprises the following steps A1-A2:

step A1: clustering training is carried out on the model in a preset counting interval range by adopting a first sample image without labels; the preset counting interval range indicates a plurality of preset categories and the proxy counting value of each category.

Step A2: inputting a second sample image with labels into a model, and updating the model based on the prediction category determined by the model and the real category of the second sample image; the true category is determined by an agent count value determined based on the second sample image.

The application can firstly adopt a large number of unlabeled sample images to perform non-supervision training on the model to acquire the classification capability of the image blocks, then acquire the proxy count value of each class based on a small amount of labeled data, and can realize effective utilization of unlabeled data through decoupling of two training stages.

When performing clustering training through a sample image without labels, the clustering number needs to be determined first, that is, all count value ranges are divided into intervals, for example, the counting intervals can be divided into 25 count intervals, that is, 25 categories under the condition of containing an image background (the number of the divisions is only taken as an example, and the application is not limited to the counting intervals); the counting value range can be divided according to specific requirements of different application scenes. Specifically, for different count values, the image blocks may be divided into different continuous sections from small to large based on the magnitude of the count value, and different categories are assigned to different section labels that the count value falls into, for example, image blocks whose local count value falls into the same section will be divided into the same category.

In the present application, for any patch (patch) on the density map of the labeled sample image, the density value of each pixel therein can be summed to obtain the total number of targets in the patch, denoted as d _i Then according to the set counting interval range, d can be calculated _i Fall under a certain class c _i As a true category label for this patch.

Alternatively, the training operations involved in steps A1-A2 may be repeated in a loop until the accuracy of the trained model on the validation set converges.

The network structure of the object counting model adopted in the embodiment of the present application is described below.

Specifically, the network structure of the target counting model comprises at least two downsampling layers and at least two upsampling layers, a pooling layer is arranged between the downsampling layers, and jump links are arranged between the upsampling layers and the downsampling layers; and the target counting model outputs a target density map through the last upsampling layer.

In a possible example, a U-shaped network structure as shown in fig. 7 may be adopted, and on the downsampling side, a plurality of convolution blocks (such as convolution block 1 and convolution block 2. In fig. 7. Convolution block N) may be included, and on the downsampling side, a front-end network such as VGG16 may be adopted, where each internal convolution layer number in consecutive convolution blocks of VGG16 is 2, 3 and 3, respectively; while the number of convolved channels in the same convolution block is uniform, the number of convolved channels is 64, 128, 256, 512, and 512, respectively, for each convolution block of VGG 16. In addition, in the downsampling part, a pooling layer is arranged between the convolution blocks, for example, the pooling layer 1 is arranged between the convolution blocks 1 and 2, and spatial downsampling can be realized through pooling operation (for example, max pooling) so as to increase the network receptive field and prevent local translation from deforming. On the up-sampling side, the deployed convolution layers can be correspondingly deployed based on the deployment of the convolution blocks on the down-sampling side, that is, the convolution layers can also be correspondingly deployed with N layers. In addition, jump links are introduced between the up-sampling and the down-sampling, and more characteristic detail information is introduced for the up-sampling; the characteristic information output by the convolution block 1 can be added with the characteristic information output by the last convolution layer N-1 at the up-sampling side by jumping linkage, and the combined characteristic information is used as the input of the last convolution layer N at the up-sampling side. Alternatively, the upsampling side may perform bilinear upsampling.

Optionally, the backbone network adopted in the target counting model of the embodiment of the present application is not limited to a U-shaped network structure, and any network structure capable of extracting a high-resolution feature map with strong semantic information may be used as the backbone network adopted in the present application, such as VGG 19.

The following describes a specific process of on-line prediction of the object count model employed in the embodiment of the present application.

In a possible embodiment, the target density map corresponding to the target object is obtained in step S101 through the target counting model, including the following steps S101a-S101c:

step S101a: and generating a corresponding first response graph aiming at each target center point corresponding to the target object in the target image.

Step S101b: and adding all the first response graphs to obtain a target response graph.

Step S101c: and carrying out convolution operation on the target response graph through normalized Gaussian kernel to obtain a target density graph.

Specifically, in the process of generating the target density map, N target center points x in the target image are considered for the target object to be estimated ₁ ,…,x _n . For each target center point x _i A two-dimensional first response diagram H can be generated _i In the first response diagram, the pixel value of the target center point position may be identified as 1, and the rest positions are identified as 0; on the basis of this, N first response maps can be obtained. Then the first response graphs H corresponding to all the target center points _i Adding to obtain second response graphs H of all target center points in the original target image; it is understood that the integral value of the second response map is the total number of targets (total count value) including the target object in the target image. However, the method is thatThereafter, for each target center point, it can be assumed that its contribution to the density of surrounding pixel points decays as a gaussian function, so a normalized gaussian kernel G can be used _σ And carrying out convolution operation on the second response graph H to obtain a target density graph D.

In a possible embodiment, determining the total count value of the target object included in the target image based on the target density map in step S101 includes at least one of the following steps S101d-S101 e:

step S101d: and carrying out integral calculation on the target density map to obtain a total count value corresponding to the target object.

Wherein, considering that the gaussian kernel is normalized, integrating the target density map D obtained after convolution can obtain a total count value corresponding to the target object. The task of the target counting model is to predict the target density map through a network, and then integrate the predicted target density map to obtain a predicted total count value.

Step S101e: determining a predicted object class corresponding to each sub-density map in the target density map, determining an agent count value of the class corresponding to the predicted object class in a preset counting interval range as a sub-count value of the sub-density map, and summing the sub-count values of all the sub-density maps to obtain a total count value of the target object in the target image; the sub-density map corresponds to an image block obtained by dividing the target image by the target counting model.

When the target counting model is used for counting prediction, as in the steps S101a-S101c, the target density map may be formed by adding together a plurality of response maps, and in the processing, the target counting model may perform the dividing processing on the target image and perform the counting estimation on a plurality of divided image blocks respectively. Therefore, the target density map may include a plurality of sub density maps, for each sub density map, the target counting model trained in the embodiment of the present application may predict a class, the proxy count value of the counting interval corresponding to the class may be used as the predicted target number (sub count value) of the sub density map, and finally the predicted target number of each sub density map may be summed to obtain the total target number (total count value) including the target object in the target image.

The following describes a training process of the object counting model provided by the embodiment of the present application.

In a possible embodiment, the clustering training is performed on the model by using the unlabeled first sample image in step A1, and the clustering training includes the following steps a 11-a 12:

step A11: a first count model trained based on the annotated third sample image is acquired.

The initial network can be initialized and trained by using a small amount of sample images with labels, so that the model obtains basic target counting capacity.

Alternatively, the first counting model may also be a pre-trained network, such as a network that is task trained with some pervasive data.

Step A12: and (3) performing the iterative operation of the following steps B1-B2 on the first counting model until a preset iterative stopping condition is reached:

step B1: and clustering training is carried out on the first counting model by adopting a first sample image without labels.

Based on the first counting model, a large amount of non-labeling data can be adopted for performing non-supervision training based on a preset counting interval range.

Optionally, the unlabeled first sample image may be obtained from some related dataset or may be obtained from a network resource; for example, for a cell counting scene, a large number of images or videos related to cells can be acquired on a network as unlabeled sample images, and image data related to the cells in historical experimental data can also be acquired as unlabeled sample images.

Optionally, in step B1, clustering training is performed on the first count model by using the unlabeled first sample image, including step B11: dividing the first sample image into at least two image blocks based on a preset dividing mode, and executing the following operations of step B111-step B114 for each image block in the first sample image so as to enable the clustering center to converge:

step B111: and extracting the characteristic information of the current image block.

Step B112: and determining feature embedding of the current image block based on the feature information and coordinate information of the current image block in the first sample image.

Step B113: and determining the cluster to which the current image block belongs based on the feature embedding.

Step B114: based on the clustering result of the current image block, updating the clustering center of the cluster.

Extracting feature information phi (i) of each image block on large-scale unlabeled data based on the first counting model obtained in the step B1, and obtaining feature embedded e of the current image block by splicing the feature information phi (i) with coordinates of the current image block (the image blocks belonging to the same image adopt the same coordinate system to determine the coordinate information) _i All image blocks are then clustered based on the maximum Expectation algorithm (EM) assuming, in particular, the z-th of all clusters S _i The clustering is S, and a certain image block is R _s The clustering center of the cluster is mu _s The method comprises the steps of carrying out a first treatment on the surface of the The cluster z to which the current image block i belongs can be trained by calculating the expected step by utilizing the embedded feature ei of the current image block i _i ，z _i ＝arg max _s μ′ _s e _i The cluster center of the cluster is then updated by a maximization stepThe EM steps are repeated until all the image blocks are traversed and the cluster center converges steadily.

In the maximum expectation algorithm, firstly, calculating an expectation (E-step), and calculating a maximum likelihood estimated value of the hidden variable by using the existing estimated value of the hidden variable; and a second step of maximizing (M-step) the maximum likelihood value obtained on the E-step to calculate the value of the parameter. The parameter estimates found on the M-step are used in the next calculation E-step, and the process is continued alternately until the cluster center converges.

Step B2: and updating the network parameters of the first counting model based on the posterior probability of the clustering result to obtain a second counting model.

Wherein, after the above clustering process of step B111-step B114 is completed, a prediction result of the current model (first count model) for each image block of each image on the unsupervised data may be obtained. Based on this, consider that since the model prediction result based on the training of the small sample is not ideal, for example, there may be a case that the count value of some image blocks in the same cluster is far from the count value of other image blocks in the present cluster, or there may be a case that the count values of image blocks in different clusters are similar, it is necessary to maximize the posterior probability that each image block belongs to the present cluster based on the present cluster result, and update the parameters of the first count model accordingly, so that optimization of model features can be guided by using a large amount of unsupervised data.

Optionally, updating the network parameters of the first counting model based on the posterior probability of the clustering result in step B2 to obtain a second counting model, including steps B21-B24:

step B21: and determining the posterior probability that each image block belongs to the current cluster.

Step B22: a log-likelihood value for each image block is determined based on the posterior probability.

Step B23: based on the log likelihood values of all image blocks of the current first sample image, the posterior probability of each image block belonging to the current cluster is maximized in a gradient descent mode.

Step B24: and updating the network parameters of the first counting model based on the maximized posterior probability to obtain a second counting model.

Specifically, for each image block, the posterior probability p of belonging to the current cluster S may be calculated by the following formula (1):

then, on the basis of the calculation of the posterior probability p, the log likelihood value thereof can be calculated by the following formula (2):

finally, the log likelihood values of all image blocks of the current sample image are added and maximized by means of gradient descent.

According to the embodiment of the application, the negative log likelihood can be optimized through the operations, the first counting model can perform feature learning based on the unsupervised data, and the result of the learning enables the count values of the image blocks in the same cluster to be kept as consistent as possible, and the local count values of the image blocks in different clusters have larger differences as possible.

In an example, other factors may be combined to maximize the posterior probability, such as that the image features of two image blocks are very similar but belong to different clusters, and then the posterior probability that the current image block belongs to the current cluster or the posterior probability of the cluster to which the other image block belongs may also be maximized.

Optionally, the above clustering and the operation of maximizing the posterior probability may be repeatedly performed until the purity inside the clusters reaches the expectation (reaching the preset iteration stop condition), and finally the second count model is obtained.

In a possible embodiment, the second sample image with label is input into the model in step A2, and the model is updated based on the predicted class label determined by the model and the real class label of the second sample image, which includes the following steps a 21-a 24:

step A21: and inputting a second sample image into a second counting model, obtaining a prediction density map corresponding to the second sample image through the second counting model, and obtaining a prediction category corresponding to each patch in the prediction density map, wherein the patch corresponds to an image block obtained by dividing the second sample image by the second counting model.

The second counting model is a model obtained by carrying out clustering training on a large amount of unlabeled data, and the clustering center of the category indicated by the corresponding counting interval range is also adjusted. On this basis, correction of the proxy count value may be made for the category predicted by each patch based on the second count model. It can be understood that after each second sample image is processed by the second counting model, a corresponding prediction density map is obtained, and any patch (patch) in each prediction density map has a corresponding prediction category.

Step A22: and respectively determining the count value of each patch, and determining the real type of the corresponding patch based on the count value according to the preset count interval range.

Wherein, for any patch (patch) of the same sample image, the density value of each pixel can be summed to obtain the total number of targets in the patch, denoted as d _i Then according to the set counting interval range, d can be calculated _i Fall under a certain class c _i A true category label as this patch; then, labels predicting this class are learned for each patch using a cross entropy loss function.

Optionally, the preset count interval range indicates a preset plurality of categories and an agent count value for each category. Step a22 determines a count value of each patch, and determines a true category of the corresponding patch based on the count value according to a preset count interval range, including steps a 221-a 222:

step a221: the average value of the true count values of patches belonging to the same category is determined, and the average value is determined as the proxy count value of the corresponding category.

Wherein for each image block on the annotated sample image (the image block of the sample image corresponds to the patch of the density map), the current second count model predicts a class. On the basis, the real local count values of all the image blocks predicted to be in the same class can be counted, and the average value of the real local count values is taken as the proxy count value of the class.

Step a222: for each patch, determining the category of the agent count value in the range of the count interval corresponding to the count value of the patch as the true category of the patch.

The counting interval range may indicate that there are multiple preset categories and corresponding counting intervals of each category, for example, if the preset count interval range is divided into 20 counting intervals (i.e., 20 categories), different counting values may be distinguished into different categories. Based on the above, the patch can be classified into the corresponding category based on the target count value of the patch itself within the set count interval, and the category is taken as the true category, and the patch can be determined as the label of the marked data.

Step A23: a loss value is determined based on the true category and the predicted category.

The model can learn the predicted category for each patch by calculating a loss value based on the predicted category and the real category through a cross entropy loss function and the like.

Step A24: and updating network parameters of the second counting model based on the loss value to obtain a target counting model.

In the process of correcting the proxy count value, the model can learn the category of each image block in the sample image, so that the correction of the proxy count value and the update of network parameters are realized, and finally, the second counting model is trained to obtain the target counting model.

In the embodiment of the application, the reasoning is performed on the small-batch labeling data through the steps A21-A24, and at the moment, because the parameters of the second counting model are greatly changed, a prediction result with larger difference from the first counting model can be generated, so that the agent count value of each category (counting interval) can be re-estimated by using the labeling data, and the addition of the local count values of the same image is still equal to the total target number of the whole image.

In order to realize the utilization of unlabeled data, the embodiment of the application adopts a target counting method capable of training by utilizing large-scale unlabeled data, and a target counting frame based on local counting value classification, by decoupling model parameter learning and agent counting value acquisition and optimizing the characteristics based on large-scale unlabeled data, image blocks with similar counting values are divided into the same cluster in a characteristic space, and then the agent counting value of each class is corrected based on small-batch labeled data. The training method adopted by the embodiment of the application can avoid high cost caused by labeling a large amount of unlabeled data, is beneficial to improving the generalization capability of the counting algorithm and is beneficial to the deployment of the target counting algorithm in an open scene. As shown in fig. 8, in addition to effectively utilizing a large amount of unlabeled data to perform model training, compared with the existing density map prediction mode of mean square error (Mean Square Error, MSE) regression, the embodiment of the application has better robustness to abnormal points.

In order to better illustrate the image processing method provided by the embodiment of the present application, some possible application examples are given below.

Application example 1

Application scene: considering that the goats do not have regularity in walking on the grassland, and although the goats generally move in groups, the problem that one of the goats at two ends is lost may exist, therefore, the camera device can be installed in different goats, and the number of the goats in the different goats can be estimated through images or videos shot at regular time, so as to timely know whether the goats are lost.

Considering that the situation that the number of the sheep needs to be followed regularly may involve larger or frequent data processing, the image data shot by the image pickup device may be transmitted to the server in real time, and the server executes the image processing method to estimate the number of the sheep existing in the sheepfold in real time.

Optionally, the image capturing device may be in a working state for a long time, record video data within a currently visible sheepfold range in real time, and transmit the video data to the server in real time. When the server acquires the video data uploaded by the image capturing device, the server can acquire corresponding image frames at fixed time based on a preset period for processing.

Considering that 12:00 and 17:00 per day will drive flocks to sheepskins, an estimate of the number of goats can be made at these two times of the day. At this time, the image capturing apparatus may set to operate at two times of 12:00 and 17:00, respectively, capture at least one image or record a video for a preset period of time, and send the captured images to the server. Considering that the goats are in a continuous moving state in the sheepskin, part of the goats are possibly blocked in a short time and cannot acquire accurate counts, after the server acquires the images, the server can process each acquired frame of images to acquire corresponding total count values of each frame of images, and the average value of the total count values is output as the result of the current time.

If at time 12:00, the image capturing device is triggered by a self-timer to perform image capturing work, responds to an image capturing instruction issued by the server or responds to an image capturing instruction sent by the client having a binding relationship with the image capturing device by an operation object, captures three images at intervals of 10 seconds respectively, and sends the three images to the server. When the server receives three images, the number of the sheep is estimated through a target counting model according to the three images, after the number of the sheep 1, the number of the sheep 2 and the number of the sheep 3 which are respectively included in each image are obtained through executing the image processing method, the average value of the three values can be calculated to be used as the number of the sheep in the estimated sheepfold when the time is 12:00.

In the above scenario, when model training is performed, the training data may be images of the goats captured on the grassland, or images related to the goats may be obtained randomly from the network resource. The training data used may also relate to data of different kinds of sheep, considering that it is possible that the same herder breeds a plurality of goats, such as goats of different colors, etc. For black goats and other goats, when model training is performed, a small amount of image data of the black goats can be used for training a basic network to obtain a first counting model, a large amount of unlabeled images can be used for training the first counting model to obtain a second counting model, and finally, a small amount of image data of the black goats is used for updating the second counting model and correcting the proxy counting value to obtain a final target counting model.

Application example two

Application scene: if the number of the bred cows needs to be counted at fixed time in a certain cattle farm, a camera device can be installed in a cowshed, and the number of the cows in a visual range can be estimated through images or videos shot at fixed time.

As shown in fig. 4, it is assumed that the number of cattle in the map needs to be estimated; in effect, FIG. 5 shows the corresponding real density map of FIG. 4 estimated to include 579 cattle; fig. 6 shows the target density map predicted from the target count model corresponding to fig. 4, with the estimated target density map comprising 572 cows. It will be appreciated that fig. 5 and 6 are a thermodynamic diagram showing that more densely packed lines represent a higher target density.

In the above-mentioned scene, in order to avoid the problem that the number of cattle cannot be accurately estimated due to the dense shielding of the cattle, the image pickup device may be installed at different viewing angles, so as to pick up images or videos in the current visual range from different angles at the same time. It will be appreciated that the visual range of the different camera devices needs to be consistent. As for the region shown in fig. 4, one image pickup device may be installed in each of the four directions of the southeast, the northwest, and the like, and image data collection may be controlled at the same time.

Such as image data acquisition at 1 hour intervals during the period of time when the cowshed is open. Assuming that the cowshed is open at 10:00-21:00, image acquisition can be performed every 1 hour starting at 10:00. When the current third image acquisition is performed at 12:00, each image pickup device can shoot at least one image or record at least two frames of image frames at the same time, and respectively transmit the acquired image data to image processing equipment (which can be a terminal or a server), when the image processing equipment receives the image data acquired by different image pickup devices, the image processing equipment can identify the source of the image data (such as adding equipment id), on the basis, the image processing equipment can respectively estimate the number of cattle corresponding to four different image data by adopting a target counting model, and the average value of the number of cattle is taken as the total counting value of the current time of 12:00.

In order to reduce the cost of acquiring image data by the image capturing devices, after the scheme is applied for a period of time, the estimated value of the images acquired by different image capturing devices is matched with the output total count value, and only the image capturing device with the highest matching degree is reserved for continuing the image acquisition work. For example, after the application of four image capturing apparatuses 30 days, each image capturing apparatus acquires 240 parts of image data (960 parts are taken as total image capturing apparatuses), the image processing apparatus outputs a corresponding count value for each part of image data, and takes the average value of the count values corresponding to the four parts of image data as a total count value in each time period, that is, the total count value is 240 parts (determined by the image data acquired by the four image capturing apparatuses). Based on the above, 240 pieces of data related to the total count value can be matched with the count value corresponding to 240 pieces of image data of each image pickup device (such as similarity calculation of the values), the image pickup device with the highest matching degree of the count value is taken as the target image pickup device, and only the target image pickup device is reserved for image data acquisition in subsequent application, so that the cost of bovine quantity statistics is reduced.

Application example III

Application scene: in a biological laboratory, the cell number of a batch of samples needs to be counted, and the cost for manually labeling the cells is very high, so that microscopic images of each sample can be shot through an electron microscope, and the cell number included in each sample is estimated through the image processing method of the embodiment of the application.

The electron microscope may have a communication connection relationship with the image processing apparatus (terminal or server), and when a microscopic image of the current specimen is acquired through the electron microscope, the microscopic image may be transmitted to the image processing apparatus in real time, and the image processing apparatus estimates the number of cells included in the current specimen through the target counting model.

Alternatively, the image processing apparatus may output a target density map (a density distribution thermodynamic diagram as shown in fig. 6) and simultaneously output a count value determined based on the target density map.

It can be appreciated that the model can be trained using training data adapted to different scenarios, respectively, as adapted to the three different application examples described above. In an example, to reduce the cost of labeling data, the first count model obtained in step a11 may be applied to model training of different fields or scenes as a basic model for subsequent clustering training; accordingly, the third sample image with label used in step a11 may relate to multiple scenes, and the first counting model obtained by training may be regarded as a pre-training model as a basic model for performing different scene target counting tasks for different specific scenes.

It should be noted that, in the alternative embodiment of the present application, the related data (such as the target image, the first sample image, the second sample image, the count value, etc.) needs to be licensed or agreed upon by the user when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. That is, in the embodiment of the present application, if data related to the subject is involved, the data needs to be obtained through the subject authorization consent and in accordance with the relevant laws and regulations and standards of the country and region.

An embodiment of the present application provides an image processing apparatus, as shown in fig. 9, the image processing apparatus 100 may include: a counting module 101.

The counting module 101 is configured to input a target image into a pre-trained target counting model, obtain a target density map corresponding to a target object through the target counting model, and determine a total count value of the target object included in the target image based on the target density map; the target density map indicates count values of the unit pixels corresponding to the target object at corresponding positions.

The apparatus 100 further includes a training module configured to perform the following training operations of the target count model: clustering training is carried out on the model in a preset counting interval range by adopting a first sample image without labels; the preset counting interval range indicates a plurality of preset categories and the proxy counting value of each category; inputting a second sample image with labels into a model, and updating the model based on the prediction category determined by the model and the real category of the second sample image; the true category is determined by an agent count value determined based on the second sample image.

In a possible embodiment, the counting module 101 is specifically configured to, when configured to obtain the target density map corresponding to the target object through the target counting model:

adding all the first response graphs to obtain a target response graph;

In a possible embodiment, the counting module 101 is specifically configured to at least one of the following when configured to determine, based on the target density map, a total count value of a target object included in the target image:

In a possible embodiment, the training module, when used for performing cluster training on the model by using the unlabeled first sample image, is specifically configured to:

In a possible embodiment, the training module, when configured to perform cluster training on the first count model using the unlabeled first sample image, is specifically configured to:

extracting characteristic information of a current image block;

In a possible embodiment, the training module is specifically configured to, when configured to update the network parameters of the first counting model based on the posterior probability of the clustering result to obtain the second counting model:

In a possible embodiment, the training module, when used for executing the input of the second sample image with the label into the model, and updating the model based on the predicted category label determined by the model and the real category label of the second sample image, is specifically used for:

inputting a second sample image into a second counting model, obtaining a prediction density map corresponding to the second sample image through the second counting model, and obtaining a prediction category corresponding to each patch in the prediction density map, wherein the patch corresponds to an image block obtained by dividing the second sample image by the second counting model;

and respectively determining the count value of each patch, and determining the real type of the corresponding patch based on the count value according to the preset count interval range.

Determining a loss value based on the true category and the predicted category;

In a possible embodiment, the training module, when configured to perform determining a count value of each patch separately, and determine, based on the count value, a real class of the corresponding patch according to a preset count interval range, is specifically configured to:

The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.

The modules involved in the embodiments of the present application may be implemented in software. The name of the module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module that acquires a target image" or the like.

The image, the density map, the count value and the like according to the embodiment of the application can be stored by a blockchain technology. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains a certain amount of processed data that is used to verify the validity of its information (anti-counterfeit) and to generate the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of an image processing method, and compared with the related technology, the method can realize the following steps: when the target counting task is executed, the acquired target image can be input into a pre-trained target counting model, a target density map corresponding to the target object is obtained through the target counting model, the target density map can indicate the count value of the unit pixel corresponding to the target object at the corresponding position, and then the total count value of the target object can be determined based on the target density map; the target counting model can directly adopt unlabeled sample images to perform clustering training within a preset counting interval range, and then correct the model based on the labeled sample images; the preset counting interval range indicates a plurality of preset categories and agent counting values of each category; the true class of the sample image that has been annotated can be determined from the proxy count value for that sample image; on the basis, the model can be updated by predicting the prediction type of the sample image and the real type of the sample image through the model; by decoupling model parameter learning and agent count value acquisition, feature learning is performed based on a large amount of unlabeled data, so that image blocks with similar count values are divided into the same cluster in a feature space, and then the agent count value of each class is corrected based on count value classification of labeled sample images, so that the cost of labeling a large amount of data can be reduced, and the generalization capability of a counting algorithm is improved.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 10, the electronic device 4000 shown in fig. 10 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: server and terminal.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. An image processing method, comprising:

the training step of the target counting model comprises the following steps:

clustering training is carried out on the model in a preset counting interval range by adopting a first sample image without labels; the preset counting interval range indicates a plurality of preset categories and the proxy counting value of each category;

2. The method according to claim 1, wherein the obtaining, by the target counting model, a target density map corresponding to the target object includes:

Adding all the first response graphs to obtain a target response graph;

3. The method of claim 1, wherein the determining, based on the target density map, a total count value of target objects included in the target image includes at least one of:

4. The method of claim 1, wherein clustering training the model using unlabeled first sample images comprises:

5. The method of claim 4, wherein the clustering training of the first count model using unlabeled first sample images comprises:

extracting characteristic information of a current image block;

6. The method of claim 4, wherein updating the network parameters of the first counting model based on the posterior probability of the clustering result to obtain a second counting model comprises:

7. The method of claim 4, wherein inputting the annotated second sample image into the model and updating the model based on the model-determined predicted class label and the true class label of the second sample image comprises:

Determining a loss value based on the true category and the predicted category;

8. The method of claim 7, wherein the determining the count value of each patch and determining the true category of the corresponding patch based on the count value according to the preset count interval range includes:

9. The method according to any one of claims 1-8, wherein the network structure of the target counting model comprises at least two downsampling layers and at least two upsampling layers, a pooling layer is arranged between the downsampling layers, and a jump link is arranged between the upsampling layers and the downsampling layers;

10. An image processing apparatus, comprising:

The counting module is used for inputting a target image into a pre-trained target counting model, obtaining a target density map corresponding to a target object through the target counting model, and determining a total counting value of the target object included in the target image based on the target density map; the target density map indicates count values of the unit pixels corresponding to the target object at corresponding positions;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-9.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-9.

13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1-9.