CN114612402A

CN114612402A - Method, device, equipment, medium and program product for determining object quantity

Info

Publication number: CN114612402A
Application number: CN202210207245.6A
Authority: CN
Inventors: 黄钟毅; 高斌斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-10

Abstract

The application discloses a method, a device, equipment, a medium and a program product for determining the number of objects, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring a target image including an object to be counted; obtaining a correlation characteristic representation based on the characteristic correlation condition between the target image and the labeled object; taking the correlation characteristic representation as activation weight, and carrying out self-activation on the image characteristic of the target image to obtain activation characteristic representation; performing characteristic regression processing on the activation characteristic representation to obtain density data of an object to be counted in the target image; and accumulating the objects to be counted in the target image based on the density data, and determining the target number corresponding to the objects to be counted. The whole image is self-activated through the correlation between the whole image and the labeling object, and the error of the determined target number is reduced. The embodiment of the invention can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

Description

Method, device, equipment, medium and program product for determining object quantity

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a medium, and a program product for determining a number of objects.

Background

In the field of artificial intelligence, small sample Learning (Few-Shot Learning) is a machine Learning paradigm for a small number of data sets, and has applications in the fields of computer vision technology, natural language processing, and the like. Among them, in the field of image processing, small sample learning is applicable to a Counting task for an object in an image, i.e., small sample Counting (Few-Shot Counting).

In the related art, a small sample Adaptation and Matching Network (Few-Shot Adaptation & Matching Network, FamNet) provides a small sample counting Matching Network and a test phase adaptive updating strategy to realize the small sample counting task. In the FamNet model, firstly, the features of an image to be recognized are extracted, then the support sample features in the image features are intercepted, the correlation operation is carried out on the support sample features and the complete image features, the correlation result is sent into a regression network to obtain a density map of the object to be counted, and the number of the object to be counted in the image can be determined according to the density map.

However, in the application of the above model, since the regression of the density map is directly performed from the correlation result of the low resolution and low channel number, a large amount of original image feature information is lost, and thus the counting error of the model for the object to be counted is increased.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a medium and a program product for determining the number of objects, which can reduce target counting errors in a small sample counting scene. The technical scheme is as follows:

in one aspect, a method for determining the number of objects is provided, where the method includes:

acquiring a target image comprising an object to be counted, wherein the object to be counted comprises at least one labeling object with labeling information;

obtaining a correlation characteristic representation based on the characteristic correlation condition between the target image and the labeled object;

taking the correlation characteristic representation as activation weight, and carrying out self-activation on the image characteristic of the target image to obtain activation characteristic representation;

performing characteristic regression processing on the activation characteristic representation to obtain density data of the object to be counted in the target image;

and accumulating the objects to be counted in the target image based on the density data, and determining the target number corresponding to the objects to be counted.

In another aspect, an apparatus for determining the number of objects is provided, the apparatus comprising:

the device comprises an acquisition module, a counting module and a display module, wherein the acquisition module is used for acquiring a target image comprising an object to be counted, and the object to be counted comprises at least one marked object with marked information;

the prediction module is used for obtaining a correlation characteristic representation based on the characteristic correlation condition between the target image and the labeling object;

the prediction module is further configured to perform self-activation on the image features of the target image by using the correlation feature representation as an activation weight to obtain an activation feature representation;

the prediction module is further configured to perform feature regression processing on the activation feature representation to obtain density data of the object to be counted in the target image;

and the determining module is used for accumulating the objects to be counted in the target image based on the density data and determining the target number corresponding to the objects to be counted.

In another aspect, a computer device is provided, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for determining the number of objects according to any one of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the method for determining the number of objects described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method for determining the number of objects according to any of the above embodiments.

The technical scheme provided by the application at least comprises the following beneficial effects:

when the targets in the image need to be identified and counted, labeling a part of the targets in the object to be counted to obtain a labeled object, self-activating the image features of the image by the correlation feature representation between the image and the labeled object to obtain an activated feature representation, and determining the density of the object to be counted in the image by the feature regression of the activated feature representation, so that the number of the targets corresponding to the object to be counted can be obtained. Namely, the whole image is self-activated through the correlation between the whole image and the labeling object, so that the correlation information between the whole feature of the image and the feature of the labeling object can be better captured, meanwhile, the feature information of the original whole image can be better retained, better density data can be obtained, and the error of the determined target number can be reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a diagram illustrating the definition of a small sample counting task provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a method for determining a number of objects provided by an exemplary embodiment of the present application;

FIG. 4 is a functional block diagram provided by an exemplary embodiment of the present application;

FIG. 5 is a flowchart of a method for feature self-activation provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a self-energizing module provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart of a method for obtaining density data provided by an exemplary embodiment of the present application;

FIG. 8 is a flow chart of the determination of a target quantity provided by an exemplary embodiment of the present application;

FIG. 9 is a flowchart of a method for training a model provided by an exemplary embodiment of the present application;

fig. 10 is a block diagram of a device for determining the number of objects according to an exemplary embodiment of the present application;

fig. 11 is a block diagram of a device for determining the number of objects according to another exemplary embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are briefly described:

artificial intelligence: the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Computer Vision technology (Computer Vision, CV): the method is a science for researching how to make a machine see, and particularly refers to replacing human eyes with a camera and a computer to perform machine vision such as identification and measurement on a target, and further performing graphic processing, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, automatic driving, smart traffic, and other technologies, and also include common feature Recognition technologies such as face Recognition and texture Recognition.

And (3) small sample learning: the method is a machine learning paradigm, and generally requires that after a model is trained on a Base Class (Base Class) sample, only a small number of samples are given on an unlearned new Class (Novel Class), and rapid adaptation to the new Class can be realized.

Counting the small samples: the method is an application of small sample learning in the field of image target counting. Schematically, the definition of the small sample counting task is schematically illustrated: specifically, in the training phase, the model may obtain a training image and two labels corresponding to the training image: (a) marking the central points of all sample targets in the base class; (b) the base class is partially (a small number of) labeled with bounding boxes of sample objects. After model training is performed through the training images and the corresponding labels, in a testing stage, the model can only obtain the boundary box labels of the testing images and a plurality of testing targets in a new category, and the specific number of all targets corresponding to the new category in the testing images needs to be predicted.

In one example, as shown in fig. 1, a diagram illustrating a definition of a small sample counting task provided by an exemplary embodiment of the present application is shown. In the training stage 110, the model learns the training targets corresponding to the class of "triangle" through the training image 111, and after the test targets a122, B123, and C124 in the test image 121 are labeled with bounding boxes in the test stage 120, the model needs to predict the total number of all "diamond" test targets (new classes) in the test image 121.

It is emphasized that the base class of the model in the training phase and the new class in the testing phase are completely different classes. For example, in fig. 1, the class learned by the model in the training phase is the class of "triangle", but the class of "diamond" that has not been learned needs to be subjected to target counting in the testing phase. Therefore, the task of counting small samples requires that the model has good fast adaptability to new classes.

In the related art, FamNet proposes a small sample count matching network and a test phase adaptive update strategy to implement the small sample count task. In the FamNet model, firstly, the features of an image to be recognized are extracted, then the support sample features in the image features are intercepted, the correlation operation is carried out on the support sample features and the complete image features, the correlation result of low resolution and low channel number obtained by the operation is sent into a regression network to obtain a density map of the object to be counted, and the number of the object to be counted in the image can be determined according to the density map. When the small sample is identified, the model integrally comprises a training stage and a testing stage. In the training phase, the model is trained through a sample target with a large data volume, and in the testing phase, the trained network model is further adapted to the testing target of the testing image through a minimum-quantity Loss (Min-Count Loss) and a disturbance Loss (Perturbation Loss), namely, parameters of the network model are further iteratively trained through the minimum-quantity Loss and the disturbance Loss to obtain network model parameters adapted to identify and Count the testing target in the testing image.

However, in the application of the above scheme, there are the following problems: (1) the regression of the target counting density map is directly carried out from the correlation map with low resolution and low channel number, so that the problem of information loss of original image characteristics exists, and the accurate prediction of the density map is not facilitated. (2) In the process of test reasoning, the learning and updating of the parameters of the network model when the test is carried out on the test image are needed, and a large amount of test time is consumed, so that the network model has more limitations in the practical application.

The method for determining the number of objects provided in the embodiment of the present application obtains, for a feature correlation situation between a target image and a part of labeled objects that have been labeled in the target image that needs to be subjected to target counting, a correlation feature representation between the whole image and the labeled objects, performs self-activation on image features of the target image through the correlation feature representation to obtain an activation feature representation, performs feature regression processing on the activation feature representation to obtain density data of the object to be counted, and obtains the number of objects in the image according to the density data. The method has the advantages that the characteristics of the whole image are self-activated through the correlation between the whole image and the marked object, so that the characteristic information of the original whole image can be better reserved while the correlation information between the whole characteristics of the image and the characteristics of the marked object is better captured, better density data can be obtained, and the error of the determined target number can be reduced. Meanwhile, the method for determining the number of the objects can be obtained by sample data training of a sample target with a large sample size in a training stage, and in the testing and application processes, even if the identified target is different from the sample target, the model can be directly applied without performing adaptive retraining on parameters of the model, so that the application range of the model and the training and testing efficiency of the model are improved.

Next, an application scenario of the embodiment of the present application is schematically described:

first, the method can be applied to a product intelligent quality inspection system in industrial AI. Illustratively, through the determination method of the number of objects provided by the embodiment of the application, the defects of the product can be identified and counted, so that a certain quantitative support is provided for the downstream product quality evaluation. In practical application, after a target counting model is obtained by training a defect target of a known defect type as sample data, a new defect type which is not learned by the model may appear on a production line.

Second, the method can be applied to an ecological environment protection visual AI monitoring system. Schematically, the method for determining the number of the objects provided by the embodiment of the application can identify the images containing the intensive targets to be monitored of the intelligent image acquisition equipment, and realize the prediction of the total number of the intensive target groups to be monitored, so that quantitative data support is provided for the downstream analysis of the ecological environment protection intelligent monitoring system. The above-mentioned objects to be monitored may be a group of birds, a group of plants, etc. which need to be monitored and protected. In the application process, when the target to be monitored is identified and counted by the target counting model obtained through training, only part of the targets to be monitored in the image needs to be labeled at low cost.

And thirdly, the method can be applied to an agricultural production visual AI monitoring system. Schematically, by the method for determining the number of the objects, the images of agricultural products of various categories of the intelligent image acquisition equipment can be identified and counted, and the number of the agricultural products to be counted is predicted, so that quantitative data support is provided for agricultural production condition analysis.

For example, the method for determining the number of objects may be applied to other scenes that need to monitor and count the targets in the image, such as a traffic monitoring system in an intelligent transportation system, a passenger flow thermal monitoring system in a shopping mall/a scenic spot, and the like, and the application scenes are not specifically limited.

The implementation environment of the embodiments of the present application is described with reference to the above noun explanations and descriptions of application scenarios. As shown in fig. 2, the computer system of the embodiment includes: terminal device 210, server 220, and communication network 230.

The terminal device 210 includes various types of devices such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, an intelligent appliance, and a vehicle-mounted terminal.

The server 220 is configured to provide a function of counting targets in the image, that is, the server 220 may call a corresponding operation module to count the targets of the specified image to be recognized according to a request of the terminal device 210.

Illustratively, the server 220 is configured with a target counting model obtained by pre-training, the target images and the annotation information of a plurality of corresponding annotation objects that the terminal device 210 needs to perform the target quantity query are transmitted to the server 220, the server 220 inputs the target images and the annotation information corresponding to the annotation objects into the target counting model, outputs the target images and the annotation information to obtain the target quantity corresponding to the target, and the server 220 transmits the target quantity to the terminal device 210.

In some embodiments, if the computing power of the terminal device 210 satisfies the application process of the target counting model, the above process of counting the targets in the image to be recognized may also be implemented by the terminal device 210 alone.

It should be noted that the server 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud Technology (Cloud Technology) is a hosting Technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture-like websites and more portal websites. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In some embodiments, the server 220 may also be implemented as a node in a blockchain system.

Illustratively, the terminal device 210 and the server 220 are connected through a communication network 230, where the communication network 230 may be a wired network or a wireless network, and is not limited herein.

Referring to fig. 3, a method for determining the number of objects shown in an embodiment of the present application is shown, in the embodiment of the present application, the method is schematically illustrated by being applied to a server shown in fig. 2, and in other embodiments, in a case that computing resources of a terminal device are sufficient, the method may also be applied to the terminal device, which is not limited herein. The method comprises the following steps:

step 301, a target image including an object to be counted is acquired.

Illustratively, the object to be counted includes at least one labeled object with labeled information. In some embodiments, a first number of objects to be counted are included in the target image, wherein the first number of objects to be counted includes a second number of annotation objects with annotation information, and the first number is greater than the second number. Alternatively, the second number may be system-specified or custom-set, and is not limited herein.

Optionally, the labeling information of the labeling object may be obtained by manual labeling. The labeling information can also be obtained by labeling through an automatic labeling module, wherein the automatic labeling module is a functional module which is obtained by training a target sample corresponding to an object to be counted and can identify and label the object to be counted in the image, and in this case, the automatic labeling module is lower in identification precision relative to the whole counting model and is used for pre-identifying the object to be counted before the identification and counting process.

In some embodiments, the annotation information may be a bounding box for framing the annotation object, and illustratively, the shape corresponding to the bounding box may be a regular-shaped frame (e.g., a rectangular frame) sufficient for framing the annotation object, or may be a split frame for performing pixel splitting according to the boundary of the annotation object, that is, the annotation information is a split label for the annotation object at a pixel level, where the corresponding annotation boundary of the annotation information is the same as the boundary between the annotation object and the image background, and is not particularly limited herein.

Optionally, the target image may be uploaded by a terminal device, may also be acquired from a database, and may also be acquired from the internet, which is not limited herein. In some embodiments, the target image may be data represented by a three-dimensional matrix, e.g., the target image X ∈ R^H×W×3Wherein R represents a real number, H represents a matrix height of the three-dimensional matrix, W represents a matrix width of the three-dimensional matrix, and 3 represents a matrix length (i.e. a channel number) of the three-dimensional matrix, wherein elements in the three-dimensional matrix may be gray values or pixel values corresponding to pixel points in the target image.

In some embodiments, after the target image is acquired, feature extraction may be performed on the target image to obtain image features, and the image features participate in a subsequent data processing process. The target image is input to a feature extraction Network for feature extraction, so as to obtain image features, optionally, the feature extraction Network may be a neural Network such as a Residual Network (ResNet), a Visual Geometry Network (VGGNet), an alexan Network (AlexNet), or a Network structure based on a Transformer (transform), for example, a Visual Transformer (ViT), a Pyramid Visual Transformer (PVT), a Swin Transformer (Swin Transformer), or the like, or may be a Network structure based on a Transformer (transform), which is not limited herein.

In some embodiments, the target image is subjected to feature extraction to obtain at least one layer of image features, that is, the target image is subjected to feature extraction to obtain image features of a target layer number. Optionally, the target number of layers may be preset by the system, or may be indicated by the terminal device, which is not limited herein.

Step 302, based on the feature correlation between the target image and the labeled object, a correlation feature representation is obtained.

Illustratively, the feature correlation condition is used to indicate the semantic correlation of the image between the target image and the annotation object, i.e. the calculation of the correlation is performed for the semantic meaning respectively corresponding to the target image and the annotation object.

In some embodiments, a correlation feature representation between the target image and the annotation object can be obtained by a correlation calculation unit, which performs a correlation calculation on a feature level for the input target image and the annotation object with annotation information, to output the correlation feature representation.

And acquiring target characteristics corresponding to the marked object according to the target information corresponding to the marked object, and performing convolution operation through the target characteristics and the image characteristics to acquire the relevance characteristic representation. Schematically, after the target image is subjected to the feature extraction operation, the image features are obtained, and the target features corresponding to the labeled object can be obtained by intercepting from the image features through the labeling information of the labeled object.

In some embodiments, when the labeling information is a rectangular bounding box, the correlation feature representation can be directly obtained through convolution operation of the image feature and the target feature; in another embodiment, when the label information is the above-mentioned divided frame, in order to facilitate the convolution operation, the rectangular frame of the above-mentioned divided frame is selected by a sufficient frame to fill the intercepted target feature, wherein the feature data inside the rectangular frame and inside the divided frame (i.e. the part including the label object) is set as the feature data of the label object intercepted from the image feature, the feature data inside the rectangular frame and outside the divided frame is set as 0, and the obtained feature data corresponding to the above-mentioned rectangular frame is used as the target feature to perform convolution operation with the image feature.

And 303, self-activating the image features of the target image by taking the correlation feature representation as activation weight to obtain activation feature representation.

In the embodiment of the application, after the relevance feature representation is obtained from the overall feature of the target image and the target feature of the labeled object, the relevance feature representation can be regarded as a weight and then multiplied to the complete semantic feature of the target image, so as to realize the process of feature self-activation, and obtain the activation feature representation. Illustratively, the activation-feature representation is obtained by performing a pixel-by-pixel multiplication of the image feature of the target image and the correlation-feature representation.

Illustratively, the relevance feature representation may also be pre-processed during a self-activation process of the relevance feature representation and the image features of the target image.

In some embodiments, the preprocessing process includes nonnegating the correlation feature representation, i.e., subjecting the correlation feature representation to nonnegating operations, resulting in an nonnegating correlation feature representation. In other embodiments, the correlation characteristic indicates that nonnegativity processing is not required, and may be specifically set according to an actual data processing requirement, which is not limited herein.

In other embodiments, since the size of the data is different between the obtained correlation feature representation and the image feature of the target image, before performing the self-activation, the preprocessing process may further include performing a copy expansion operation on the similarity data to align the similarity data with the image feature of the target image in the channel size, so as to obtain an expanded feature representation, and performing the self-activation operation of correlation between the expanded feature representation and the image feature of the target image. Schematically, performing data replication expansion on the correlation feature representation on a channel dimension to obtain an expanded feature representation, and performing semantic feature correlation self-activation on the image features by using the expanded feature representation as activation weight to obtain an activation feature representation.

And step 304, performing characteristic regression processing on the activation characteristic representation to obtain density data of the object to be counted in the target image.

After the activation characteristic representation is obtained, the activation characteristic representation can be input into a regression network, and the regression network carries out prediction to obtain density data corresponding to the object to be counted. Optionally, the regression network may be various networks capable of realizing spatial resolution upsampling, and the regression process may also include various interpolation algorithms, which is not limited herein.

Alternatively, the output density data may be data in an array form, a matrix form, or data in an image form, which is not limited herein.

Step 305, accumulating the objects to be counted in the target image based on the density data, and determining the target number corresponding to the objects to be counted.

Illustratively, after the density data corresponding to the objects to be counted is obtained, the number of the objects to be counted in the target image can be determined through the density data. In some embodiments, the target number corresponding to the object to be counted may be obtained by performing an integration operation on the density data.

In one example, as shown in fig. 4, which illustrates a functional block 400 schematic provided by an exemplary embodiment of the present application, the functional modules 400 for making the determination of the target number comprise a feature extraction module 410, a self-activation module 420, a regression module 430, a statistics module 440, wherein, the feature extraction module 410 is used for performing feature extraction on an input target image to obtain image features corresponding to the target image, the self-activation module 420 is used for extracting feature correlation between the target image and an annotated object to obtain correlation feature representation, then, the image features are self-activated according to the relevance feature representation to obtain an activation feature representation, the regression module 430 is used for performing feature regression processing on the activation feature representation to obtain density data, and the statistical module 440 is used for performing statistical integration on the density data to output the target number corresponding to the object to be counted.

To sum up, according to the method for determining the number of objects provided in the embodiment of the present application, when the objects in the image need to be identified and counted, a part of the objects to be counted is labeled to obtain a labeled object, the image features of the image itself are self-activated by the correlation feature representation between the image and the labeled object to obtain an activated feature representation, and the density of the objects to be counted in the image is determined by feature regression of the activated feature representation, so that the number of the objects to be counted can be obtained. The method has the advantages that the characteristics of the whole image are self-activated through the correlation between the whole image and the marked object, so that the characteristic information of the original whole image can be better reserved while the correlation information between the whole characteristics of the image and the characteristics of the marked object is better captured, better density data can be obtained, and the error of the determined target number can be reduced.

Referring to fig. 5, a method for feature self-activation is shown in an embodiment of the present application, in the embodiment of the present application, a self-activation module in an overall function module is schematically illustrated, where the self-activation module includes an acquisition process of a correlation feature representation and a processing process of a self-activation operation. The method comprises the following steps:

step 501, obtaining the image characteristics of the ith layer in the image characteristics.

Schematically, after a target image is subjected to feature extraction, image features of a target layer number are obtained, in the embodiment of the present application, at least one layer of image features in the target layer number is used to perform a self-activation process of the features, so as to finally obtain the number corresponding to an object to be counted in the target image. In the embodiment of the present application, the image feature feat of the i-th layer among the image features for the target layer number_iThe process of (a) is schematically illustrated as an example, wherein,

h_ithe matrix of the three-dimensional matrix representing the image features in the ith layer of image features is high, w_iA matrix width, c, representing the three-dimensional matrix_iAnd representing the number of channels corresponding to the image features of the ith layer, wherein i is a positive integer.

Step 502, mapping the labeling information corresponding to the labeling object to the image feature corresponding to the ith layer to obtain the labeling information of the feature layer.

In some embodiments, for the annotation information of the annotation object is an annotation frame for the annotation object based on the target image, illustratively, since the annotation information carried by the annotation object in the target image corresponds to the image size of the target image, that is, the size of the annotation information corresponds to the size of the annotation object in the target image, and after the target image is subjected to feature extraction, the size between the image feature and the target image is different, and the spatial resolution corresponding to the image feature decreases with the increase of the number of feature layers, that is, the spatial resolution and the number of feature layers are in a negative correlation relationship, if the annotation information of the annotation object is directly corresponding to the image feature of the ith layer, an error occurs due to the difference in size, and therefore, the annotation information corresponding to the annotation object needs to be mapped to correspond to the image feature of the ith layer in a spatial dimension.

Illustratively, the annotation information box corresponding to the jth annotation object in the k annotation objects_jMapping the dimension to the dimension corresponding to the image feature of the ith layer to obtain the marking information box of the feature layer_ijWherein k is a positive integer, and j is a positive integer less than or equal to k.

In some embodiments, the mapping process of the annotation information may be implemented according to a scaling ratio between the image feature of the ith layer and the target image, for example, obtaining first size information (e.g., a height or a width of a spatial dimension) corresponding to the target image, obtaining second size information corresponding to the image feature of the ith layer, and performing a scaling operation on the annotation information based on a ratio between the first size information and the second size information to obtain the annotation information box of the feature level_ij。

Step 503, obtaining a correlation characteristic representation based on the characteristic correlation between the labeling information of the characteristic layer and the image characteristics of the ith layer.

And after the labeling information is mapped to correspond to the image characteristics of the ith layer, calculating the correlation between the labeling information and the image characteristics on the same characteristic layer.

In some embodiments, the process of obtaining the correlation feature representation by the annotation information and the image feature includes:

s1: and intercepting the image features of the ith layer based on the labeling information of the feature layer to obtain the target features of the labeled object.

Illustratively, the label information box_ijWhen the image features of the ith layer are in the same feature layer, the marking information bOx is passed_ijFor image characteristics of i-th layerIntercepting the sign to obtain the target feature box _ feat corresponding to the marked object_ijWherein, in the process,

indicating height, width, and number of channels respectively

c_iOf the three-dimensional matrix of (a).

In other embodiments, the target feature corresponding to the annotation object may also be for the annotation information box_jAnd intercepting the target image to obtain an annotated object image, and performing feature extraction on the annotated object image through a feature extraction network to obtain the target feature of the annotated object with the feature level being the same as the image feature of the ith layer.

S2: according to box_iiAnd performing filling operation on the image characteristics of the ith layer to obtain image filling characteristics corresponding to the jth annotation object of the ith layer.

Illustratively, a fill (pad) operation is required for the image feature of the i-th layer, in one example, as shown in formula one, wherein pad _ feed_ijAccording to box_ijThe image filling characteristic, feat, of the jth labeled object on the ith layer obtained after filling operation_iIs the image characteristic of the ith layer, wherein,

indicating height, width, and number of channels respectively

c_iOf the three-dimensional matrix of (a). The purpose of the pad operation is to make the corresponding size not smaller when the feature is subjected to subsequent convolution operation, so as to realize deeper convolution, and simultaneously, more information in the original image can be retained in the acquisition process of the correlation feature representation.

The formula I is as follows: padded_feat_ij＝pad(feat_i，box_ij)

S3: and performing convolution operation on the target feature of the labeled object and the image filling feature of the jth labeled object on the ith layer to obtain the relevance feature representation.

Illustratively, the target feature box _ feat of the object to be labeled_ijAnd the image filling characteristic padded _ feed of the ith layer jth annotation object_ijAnd performing correlation calculation through convolution operation to obtain the correlation characteristic representation, namely, retinj, as shown in formula two, wherein,

indicating height, width and number of channels as h_i、w_iAnd 1, conv () represents a convolution operation.

The formula II is as follows: eat_ij＝conv(padded_feat_ij，box_feat_ij)

In other embodiments, the filling operation may not be performed, and after the convolution operation is directly performed on the image features of the i-th layer and the features of the labeling object, the spatial resolution of the image features of the i-th layer is restored to be the same as the spatial resolution of the image features of the i-th layer before the convolution operation through an interpolation algorithm.

And step 504, carrying out nonnegative processing on the correlation characteristic representation to obtain nonnegative correlation data.

Schematically, the preprocessing process here includes nonnegating the correlation feature representation, that is, subjecting the correlation feature representation to nonnegating operation, so as to obtain nonnegating correlation feature representation.

In one example, as shown in equation three, where non-negative correlation data eat'_ijFrom eat_ijIs obtained by non-negative treatment, wherein,

indicating height, width and number of channels as h_i、w_iAnd 1, min () is the minimum number of operations.

The formula III is as follows: eat'_ij＝reat_ij-min(reat_ij)

And 505, copying and expanding the non-negative correlation data on the channel dimension to obtain an expanded characteristic representation.

Since the size of the data is different between the obtained correlation feature representation and the image feature of the target image, before performing the self-activation, the preprocessing process further needs to perform a copy expansion operation on the non-negative correlation data to align the non-negative correlation data with the image feature of the target image in the channel size, so as to obtain an expanded feature representation, and perform the self-activation operation of correlation between the expanded feature representation and the image feature of the target image.

In one example, non-negatively correlated data eat 'is given as shown in equation four'_ijCopy and expand operation is carried out on channel dimension to change the channel number into c_iObtaining an extended _ repeat representation_ij，

Indicating height, width and number of channels respectively as h_i、w_i、c_iOf the three-dimensional matrix of (a).

The formula four is as follows: extended _ eat_ij＝expend(reat′_ij)

Step 506, performing self-activation of semantic feature correlation on the image features of the ith layer based on the extended feature representation to obtain an activated feature representation.

In the embodiment of the application, the expansion feature representation is regarded as weight multiplication to the complete semantic features of the target image, so that the process of feature self-activation is realized, and the activation feature representation is obtained. Illustratively, the augmented feature representation and the image feature are multiplied pixel by pixel to obtain an activation feature representation. As shown in equation five, wherein the extended feature represents extended _ repeat_ijAnd image characteristics feat of i-th layer_iPerforming pixel-by-pixel multiplication operations

Thereby realizing the self-activation of the correlation of the semantic features and obtaining the activation feature representation act _ feat_ij，

Indicating height, width and number of channels respectively as h_i、w_i、c_iOf (3) a three-dimensional matrix.

The formula five is as follows:

in an example, as shown in fig. 6, a schematic diagram of a self-activation module 600 provided by an exemplary embodiment of the present application is shown, where annotation information 601 of a jth annotation object is mapped to correspond to an image feature 602 of an ith layer to obtain annotation information 603 of a feature level, the image feature 602 of the ith layer is truncated based on the annotation information 603 of the feature level to obtain a target feature 604 corresponding to the jth annotation object, the image feature 602 of the ith layer is filled to obtain an image filling feature 605 of the ith layer, the target feature 604 and the image filling feature 605 of the ith layer are convolved to obtain a correlation feature representation 606, the correlation feature representation 606 is subjected to nonnegative processing to obtain nonnegative correlation data 607, the nonnegative correlation data 607 is subjected to data expansion to obtain an expanded feature representation 608, the expanded feature representation 608 and the image feature 602 of the ith layer are self-activated, resulting in an activation signature 609.

To sum up, the method for self-activating features provided in the embodiments of the present application performs explicit correlation calculation on semantic features of a target image and semantic features of an annotated object serving as a supporting sample, thereby simplifying the difficulty of a network in learning and mining the correlation thereof, and simultaneously performs nonnegative operation on a correlation feature representation obtained by calculation and then acts as a weight on the complete semantic features of an original target image, thereby performing self-activation of correlation, so that the features after self-activation include feature correlation between the annotated object and the target image, and also retain rich semantic feature information about the target image.

Referring to fig. 7, a method for acquiring density data according to an embodiment of the present application is shown, in which a process of determining a target number achieved by multiple layers of image features together is schematically illustrated, where the following steps 701 to 704 are performed after step 506. The method comprises the following steps:

step 701, acquiring a first activation characteristic representation corresponding to the image characteristic of the ith layer.

Illustratively, n layers of image features are obtained after feature extraction is performed on a target image, wherein the n layers of image features include image features of an ith layer, n is a positive integer greater than or equal to 2, and i is a positive integer. Inputting the image features of the ith layer to a self-activation module, and outputting to obtain a corresponding first activation feature representation, wherein the processing process of the self-activation module is shown in steps 501 to 506, which is not described herein again.

Step 702, acquiring a second activation characteristic representation corresponding to the image characteristic of the (i + 1) th layer.

Illustratively, the image feature of the (i + 1) th layer is an image feature of a layer next to the image feature of the (i) th layer, wherein the spatial resolution of the image feature of the (i + 1) th layer is smaller than the spatial resolution of the image feature of the (i) th layer. And inputting the image characteristics of the (i + 1) th layer into a self-activation module, and outputting to obtain a corresponding second activation characteristic representation.

At step 703, the second activation signature representation is upsampled to obtain a third activation signature representation aligned with the first activation signature representation.

In one example, taking n as 5 and i as 3 for example, the second activation characteristic represents act _ feat_4jIs an activation feature representation corresponding to the image feature of the 4 th layer, and the third activation feature represents act _ feat'_4jDerived from equation six, where upsample () represents the value used for the upsample operation,

indicating height, width and number of channels respectively as h₃、w₃、c₄Of the three-dimensional matrix of (a).

Formula six: act _ feat'_4j＝upsample(act_feat_4j)

And step 704, performing characteristic regression processing based on the first activation characteristic representation and the third activation characteristic representation to obtain density data.

Illustratively, after obtaining the first activation signature representation and the third activation signature representation, a feature regression process may be performed on the activation signature representations to obtain density data. In some embodiments, the first activation characteristic representation and the second activation characteristic representation are merged and connected to obtain a fourth activation characteristic representation, and the fourth activation characteristic representation is input to a regression network to obtain density data, wherein the regression network is used for performing density prediction on the fourth activation characteristic representation on the object to be counted.

The merge join process of the feature is shown in formula seven, where concatee () represents the join merge operation on the feature, act _ feat_3jRepresents a first activation characteristic representation, act _ feat'_4jRepresenting a third activation-feature representation, cat _ feat_jThe fourth activation signature representing the merging of connections represents cat _ feat_j，

Indicating height, width and number of channels respectively as h₃、w₃、(c₃+c₄) Of (3) a three-dimensional matrix.

The formula seven: cat _ feat_j＝concate(act_feat_3j，act_feat′_4j)

The regression process of the feature is shown in formula eight, wherein regression () represents the neural network for feature regression, and the fourth activation feature represents cat _ feat_jInputting the data into the regression network, and outputting the data to obtain density data den _ map_j，den-map_j∈R^H×W×1，R^H×W×1Representing a matrix of height, width and number of channels H, W and 1, respectively.

The formula eight: den _ map_j＝regress(cat_feat_j)

It should be noted that the image feature of the i +1 th layer may also be an image feature of an i + m th layer, where i + m is less than or equal to n, and m is an integer greater than 1, and in the above case, when the activation feature representation corresponding to the image feature of the i + m layer is upsampled, the number of times of upsampling is m, that is, the method for acquiring density data provided in the embodiment of the present application may also be extended to be implemented by non-adjacent image feature layers, and here, only the implementation of the image feature of the i th layer and the i +1 th layer is taken as an example, and is not particularly limited.

In some embodiments, when the object to be counted includes at least two labeled objects, the density data corresponding to each labeled object is obtained, and the density data corresponding to each labeled object is averaged to obtain the density data representing the density of the object to be counted.

In one example, k annotation objects are marked in the target image, and the annotation information box of the jth annotation object_jCorresponding first density data den _ map can be obtained_jJ ∈ {1, 2., k }, then the second density data den _ map_predIs shown in formula nine, wherein mean () represents the averaging operation on the data, den-map_pred∈R^H×W×1，R^H×W×1Representing a matrix of height, width and number of channels H, W and 1, respectively.

The formula is nine: den _ map_pred＝mean(den_map₁，den_map₂，...den_map_k)

When the second density data den _ map is obtained_predAnd then, performing integration operation on the target image to obtain the target number corresponding to the object to be counted in the target image.

Schematically, as shown in fig. 8, which shows a flowchart of determining the target quantity provided in an exemplary embodiment of the present application, in which, in order to implement the target quantity determination process through image features of layers 3 and 4, a target image 801 is subjected to a feature extraction module 810 to extract four layers of image features, where the four layers of image features include an image feature 811 of layer 3 and an image feature 812 of layer 4, the image feature 811 of layer 3 and the image feature 812 of layer 4 are respectively input into a self-activation module 820 for processing, a first activation feature representation 821 corresponding to the image feature 811 of layer 3 and a second activation feature representation 822 corresponding to the image feature 812 of layer 4 are output and are input into an intermediate processing module 830 together, where the second activation feature representation 822 is upsampled by an upsampling unit in the intermediate processing module 830, and obtaining a third activation characteristic representation 823, combining and connecting the third activation characteristic representation 823 and the first activation characteristic representation 821 through a connection unit 832 in the intermediate processing module 830 to obtain a fourth activation characteristic representation 824, inputting the fourth activation characteristic representation 824 into a regression module 840 for regression processing to obtain density data 841, and inputting the density data 841 into a statistic module 850 for integration to obtain the target quantity 851.

In summary, the embodiment of the present application provides a method for acquiring density data, which acquires activation feature representation through multi-layer image features, integrates semantic feature representation among the multi-layer image features, and further reduces an error of an overall model in prediction of a target quantity.

Referring to fig. 9, a method for training a model provided in an exemplary embodiment of the present application is shown, in which a training process of a target count model is schematically illustrated, and the method includes:

step 901, sample data is obtained.

The first sample object in the sample data is marked with first marking information, and the first sample object comprises at least one second sample object with second marking information.

Alternatively, the label information is a bounding box for framing the label object, and the shape corresponding to the bounding box may be a rectangular frame that is enough for framing the label object, or may be a frame having a shape corresponding to the label object, which is not specifically limited herein.

In one example, the first label information may be label information for labeling a center point corresponding to the first sample object, and the second label information may be label information for labeling a boundary of the second sample object.

Optionally, the sample data may be uploaded by the terminal device, may also be read from a database, and may also be acquired from the internet, which is not limited herein.

And 902, inputting the sample data and the second labeling information into the counting model to be trained, and predicting to obtain predicted density data corresponding to the first sample target.

The model structure in the counting model to be trained includes a feature extraction module, a self-activation module, an intermediate processing module, a regression module, and a statistics module, wherein the data processing process of each module is as described in the embodiments corresponding to fig. 3, 5, and 7, and is not described herein again.

Illustratively, before training, parameter initialization is performed on model parameters of a counting model to be trained, after parameter initialization, sample data is input into the counting model to be trained, and predicted density data corresponding to a first sample target is obtained through prediction, wherein the predicted density data is obtained based on current model parameters of the counting model to be trained.

And 903, performing iterative training on the counting model to be trained based on the difference between the predicted density data and the first labeling information to obtain a target counting model.

Illustratively, the standard density data corresponding to the first sample target in the sample image can be obtained through the first labeling information corresponding to the sample data, that is, the standard density data corresponding to the sample data is obtained based on the first labeling information corresponding to the first sample target. In an example, the first label information is a central point of the first sample object in the sample image, and the standard density data may be obtained according to a distribution of the central point in the sample image.

After the standard density data corresponding to the sample data is determined, the integral model can be trained by predicting the error condition between the density data and the standard density data, namely, the counting model to be trained is iteratively trained on the basis of the error loss value between the standard density data and the predicted density data to obtain the target counting model.

Illustratively, the Error Loss value is calculated by a preset Loss function, and optionally, the preset Loss function may be a regression Loss function such as a Mean Square Error (MSE) Loss function, a Mean Absolute Error (MAE) Loss function, and a Quantile Loss (Quantile Loss) function, which is not limited herein.

In one example, taking the default loss function as the mean square error loss function as an example, the loss function is shown in equation ten, wherein,

the s-th element of the standard density data representing the sample image,

and representing the s-th element of the predicted density data of the sample image, wherein the density data are both two-dimensional matrixes, H represents the height of the density data, and W represents the width of the density data.

Formula ten:

and when the error loss value meets the training requirement, determining to obtain a target counting model. Illustratively, after the training of the model is completed, the target counting model can be tested through the test data, wherein the test data comprises a third number of test targets, the third number of test targets comprises a fourth number of labeled objects, the labeled objects are labeled with labeling information, the fourth number is smaller than the third number, the test data is input into the target counting model for prediction, corresponding test density data is output, the test number of the test targets can be obtained through prediction according to the test density data, and the error of the model in prediction is determined according to the difference between the test number and the first number.

In some embodiments, the sample object in the sample data is of a first class, and the test object in the test data is of a second class, and optionally, the first class and the second class may be the same or different, and are not limited herein.

In summary, according to the training method of the model provided in the embodiment of the present application, even if the target counting model obtained through sample data training identifies the test target different from the training sample target in the test stage, time-consuming and resource-consuming test adaptive training operation is not required, the test process is simplified, a large amount of test time and calculation resources are saved, and the target counting model is more convenient and faster in practical application and has a better practical application value.

In one example, when the target Counting model provided by the embodiment of the application is applied to an experiment performed on the FSC-147(Few-Shot Counting-147) data set proposed by FamNet, under the condition that no additional network update is needed in the test stage, the MAE index performance which is significantly better than that of FamNet can be still obtained on the verification set and the test set of the model respectively.

Referring to fig. 10, a block diagram of a structure of an apparatus for determining a number of objects according to an exemplary embodiment of the present application is shown, where the apparatus includes the following modules:

an obtaining module 1010, configured to obtain a target image including an object to be counted, where the object to be counted includes at least one annotation object having annotation information;

a prediction module 1020, configured to obtain a correlation feature representation based on a feature correlation between the target image and the annotation object;

the prediction module 1020 is further configured to perform self-activation on the image feature of the target image by using the correlation feature representation as an activation weight to obtain an activation feature representation;

the prediction module 1020 is further configured to perform feature regression processing on the activation feature representation to obtain density data of the object to be counted in the target image;

a determining module 1030, configured to accumulate the objects to be counted in the target image based on the density data, and determine a target number corresponding to the objects to be counted.

In some alternative embodiments, as shown in fig. 11, the prediction module 1020 includes an activation sub-module 1040, and the activation sub-module 1040 includes:

an expansion unit 1041, configured to perform data replication expansion on the correlation feature representation in a channel dimension to obtain an expansion feature representation;

an activating unit 1042, configured to perform self-activation of semantic feature correlation on the image feature by using the extended feature representation as the activation weight, so as to obtain the activation feature representation.

In some optional embodiments, the activation unit 1042 is further configured to multiply the extended feature representation and the image feature pixel by pixel to obtain the activation feature representation.

In some optional embodiments, the activating sub-module 1040 further includes:

a nonnegative processing unit 1043, configured to perform nonnegative processing on the correlation feature representation to obtain nonnegative correlation data;

the expansion unit 1041 is further configured to perform data replication expansion on the non-negative correlation data in a channel dimension to obtain the expansion feature representation.

In some optional embodiments, the prediction module 1020 further includes:

the feature extraction sub-module 1050 is configured to perform feature extraction on the target image to obtain image features of a target layer number, where the image features of the target layer number include image features of an ith layer, and i is a positive integer;

the activating sub-module 1040 further includes:

a mapping unit 1044, configured to map the annotation information corresponding to the annotation object to a position corresponding to the image feature of the ith layer, so as to obtain annotation information of a feature layer;

a correlation processing unit 1045, configured to obtain the correlation feature representation based on a feature correlation between the labeling information of the feature level and the image feature of the ith layer.

In some optional embodiments, the feature extraction sub-module 1050 is further configured to intercept, based on the labeling information of the feature level, the image feature of the ith layer to obtain a target feature of the labeled object;

the activating sub-module 1040 further includes:

a filling unit 1046, configured to perform a filling operation on the image feature of the ith layer to obtain an image filling feature of the ith layer;

the correlation processing unit 1045 is further configured to perform convolution operation on the target feature of the labeled object and the image filling feature of the ith layer to obtain the correlation feature representation.

In some optional embodiments, the prediction module 1020 further comprises a regression submodule 1060, the regression submodule 1060 further comprising:

an obtaining unit 1061, configured to obtain a first activation feature representation corresponding to an image feature of an ith layer;

the obtaining unit 1061 is further configured to obtain a second activation feature representation corresponding to the image feature of the i +1 th layer;

an upsampling unit 1062, configured to upsample the second activation feature representation to obtain a third activation feature representation aligned with the first activation feature representation;

a regression unit 1063, configured to perform feature regression processing based on the first activation feature representation and the third activation feature representation, and obtain the density data.

In some optional embodiments, the regression submodule 1060 further includes:

a merging unit 1064, configured to merge and connect the first activation characteristic representation and the second activation characteristic representation to obtain a fourth activation characteristic representation;

the regression unit 1063 is further configured to input the fourth activation feature representation to a regression network to obtain the density data, where the regression network is configured to perform density prediction on the fourth activation feature representation on the object to be counted.

In some optional embodiments, the object to be counted includes at least two labeled objects;

the obtaining unit 1061 is further configured to obtain first density data corresponding to the at least two labeling objects, respectively;

the determining module 1030 further includes:

a determination unit 1031 configured to determine a density mean value between the first density data as second density data;

and an integrating unit 1032, configured to perform an integrating operation on the second density data to obtain a target number corresponding to the object to be counted.

In some optional embodiments, the apparatus further comprises a training module 1070, the training module 1070 further comprising:

an obtaining unit 1071, configured to obtain the sample data, where a first sample target in the sample data is marked with first tagging information, and the first sample target includes at least one second sample target with second tagging information;

a predicting unit 1072, configured to input the sample data and the second label information into the counting model to be trained, and predict to obtain predicted density data corresponding to the first sample target;

a training unit 1073, configured to perform iterative training on the to-be-trained counting model based on a difference between the predicted density data and the first labeled information, to obtain the target counting model.

In some optional embodiments, the obtaining unit 1071 is further configured to obtain standard density data corresponding to the sample data based on the first label information corresponding to the first sample target;

the training unit 1073 is further configured to perform iterative training on the to-be-trained counting model based on an error loss value between the standard density data and the predicted density data, so as to obtain the target counting model.

To sum up, the device for determining the number of objects, provided by the embodiment of the present application, when the objects in the image need to be identified and counted, annotate a part of the objects to be counted to obtain an annotated object, perform self-activation on the image features of the image itself through the correlation feature representation between the image and the annotated object to obtain an activated feature representation, and determine the density of the object to be processed in the image through feature regression of the activated feature representation, so as to obtain the number of objects corresponding to the object to be processed. Namely, the whole image is self-activated through the correlation between the whole image and the marked object, so that the complete characteristic information of the original image can be better reserved while the correlation information between the complete characteristic of the image and the characteristic of the marked object is better captured, better density data is obtained, and the accuracy of the determined target number is ensured.

It should be noted that: the device for determining the number of objects provided in the foregoing embodiment is only illustrated by dividing each of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for determining the number of objects and the method for determining the number of objects provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 12 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. Specifically, the structure includes the following.

The server 1200 includes a Central Processing Unit (CPU) 1201, a system Memory 1204 including a Random Access Memory (RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the CPU 1201. The server 1200 also includes a mass storage device 1206 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The mass storage device 1206 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1206 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1206 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1206 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1200 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1200 may be connected to the network 1212 through a network interface unit 1211 connected to the system bus 1205, or the network interface unit 1211 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for determining the number of objects provided by the method embodiments. Optionally, the computer device may be a terminal or a server.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions are stored, where the at least one instruction, the at least one program, the code set, or the set of instructions are loaded and executed by a processor to implement the method for determining the number of objects provided by the foregoing method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method for determining the number of objects according to any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining a number of objects, the method comprising:

2. The method according to claim 1, wherein the self-activating the image feature of the target image with the correlation feature representation as an activation weight to obtain an activation feature representation comprises:

performing data replication expansion on the correlation characteristic representation on channel dimension to obtain an expansion characteristic representation;

and performing self-activation of semantic feature correlation on the image features by taking the extended feature representation as the activation weight to obtain the activation feature representation.

3. The method according to claim 2, wherein the self-activating of semantic feature correlation of the image features with the augmented feature representation as the activation weight, resulting in the activation feature representation, comprises:

and carrying out pixel-by-pixel multiplication operation on the extended feature representation and the image feature to obtain the activation feature representation.

4. The method of claim 2, wherein performing a data replication expansion on the correlation feature representation in a channel dimension to obtain an expanded feature representation comprises:

carrying out nonnegative processing on the correlation characteristic representation to obtain nonnegative correlation data;

and performing data replication and expansion on the non-negative correlation data on the channel dimension to obtain the expansion characteristic representation.

5. The method according to any one of claims 1 to 4, wherein the obtaining a correlation feature representation based on the feature correlation between the target image and the labeled object comprises:

performing feature extraction on the target image to obtain image features of a target layer number, wherein the image features of the target layer number comprise image features of an ith layer, and i is a positive integer;

mapping the labeling information corresponding to the labeling object to correspond to the image features of the ith layer to obtain labeling information of the feature layer;

and obtaining the relevance feature representation based on the feature correlation condition between the labeling information of the feature level and the image feature of the ith layer.

6. The method according to claim 5, wherein the obtaining the correlation feature representation based on the feature correlation between the labeling information of the feature level and the image feature of the i-th layer comprises:

intercepting the image features of the ith layer based on the labeling information of the feature layer to obtain the target features of the labeled object;

filling the image characteristics of the ith layer to obtain the image filling characteristics of the ith layer;

and performing convolution operation on the target feature of the labeling object and the image filling feature of the ith layer to obtain the relevance feature representation.

7. The method according to claim 4, wherein the performing a feature regression process on the activation feature representation to obtain density data of the object to be counted in the target image comprises:

acquiring a first activation characteristic representation corresponding to the image characteristic of the ith layer;

acquiring a second activation characteristic representation corresponding to the image characteristic of the (i + 1) th layer;

upsampling the second activation feature representation resulting in a third activation feature representation aligned with the first activation feature representation;

and performing feature regression processing based on the first activation feature representation and the third activation feature representation to obtain the density data.

8. The method of claim 5, wherein performing a feature regression process based on the first activation feature representation and the third activation feature representation to obtain the density data comprises:

merging and connecting the first activation characteristic representation and the second activation characteristic representation to obtain a fourth activation characteristic representation;

and inputting the fourth activation characteristic representation into a regression network to obtain the density data, wherein the regression network is used for performing density prediction on the fourth activation characteristic representation on the object to be counted.

9. The method according to any one of claims 1 to 4, wherein the objects to be counted include at least two labeled objects, and the determining the number of the objects to be counted by accumulating the objects to be counted in the target image based on the density data includes:

acquiring first density data corresponding to the at least two marked objects respectively;

determining a density mean between the first density data as second density data;

and performing integration operation on the second density data to obtain the target number corresponding to the object to be counted.

10. The method according to any one of claims 1 to 4, wherein the method is applied to a target counting model, the target counting model is obtained by training a counting model to be trained through sample data, and the training process of the target counting model comprises:

obtaining the sample data, wherein a first sample target in the sample data is marked with first marking information, and the first sample target comprises at least one second sample target with second marking information;

inputting the sample data and the second labeling information into the counting model to be trained, and predicting to obtain predicted density data corresponding to the first sample target;

and performing iterative training on the counting model to be trained based on the difference between the predicted density data and the first labeling information to obtain the target counting model.

11. The method according to claim 10, wherein iteratively training the counting model to be trained based on the difference between the predicted density data and the first labeled information to obtain the target counting model comprises:

acquiring standard density data corresponding to the sample data based on the first labeling information corresponding to the first sample target;

and performing iterative training on the counting model to be trained on the basis of the error loss value between the standard density data and the predicted density data to obtain the target counting model.

12. An apparatus for determining a number of objects, the apparatus comprising:

the prediction module is used for obtaining a correlation characteristic representation based on the characteristic correlation condition between the target image and the labeled object;

13. A computer device comprising a processor and a memory, said memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by said processor to implement a method of determining a number of objects according to any one of claims 1 to 11.

14. A computer-readable storage medium, having at least one program code stored therein, the program code being loaded and executed by a processor to implement the method for determining the number of objects according to any one of claims 1 to 11.

15. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a method of determining the number of objects according to any one of claims 1 to 11.