CN111767934A

CN111767934A - Image identification method and device and electronic equipment

Info

Publication number: CN111767934A
Application number: CN201911054684.2A
Authority: CN
Inventors: 石大虎
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-10-13
Anticipated expiration: 2039-10-31
Also published as: CN111767934B

Abstract

The embodiment of the invention provides an image identification method and device and electronic equipment. Wherein the method comprises the following steps: acquiring a plurality of image characteristics of an image to be identified under a plurality of different down-sampling magnifications; for each down-sampling magnification in the plurality of different down-sampling magnifications, fusing the projections of the plurality of image features under the down-sampling magnification to obtain the fusion features of the image to be identified under the down-sampling magnification; and determining the recognition result of the image to be recognized according to the fusion characteristics of the image to be recognized under all the down-sampling multiplying powers. The fusion characteristics which simultaneously comprise complete texture information and semantic information can be obtained by fusing the image characteristics under different sampling multiplying powers, so that the fusion characteristics can be suitable for various different image recognition tasks, namely different image recognition tasks can be completed through the same flow, and the image recognition flow is simplified.

Description

Image identification method and device and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image recognition method and apparatus, and an electronic device.

Background

In some application scenarios, it may be necessary to identify the image by machine for practical purposes. For example, it may be desirable to automatically identify vehicles in an image in intelligent traffic, and to identify people in an image in, for example, store management. For example, the image features of the image to be recognized may be repeatedly extracted by using a plurality of convolutional layers connected in series until the image features satisfying the condition are obtained, and then the obtained deep image features may be mapped to the recognition result by using a plurality of convolutional layers or deconvolution layers connected in series through a plurality of times of mapping.

However, if the extracted image features are deep image features, more texture information may be lacking in the image features, and if the extracted image features are shallow image features, more semantic information may be lacking in the image features, so that the extracted image features are difficult to be applied to different image recognition tasks, and different ways need to be adopted for different image recognition tasks to extract the image features, which results in a complex image recognition process.

Disclosure of Invention

The embodiment of the invention aims to provide an image identification method, an image identification device and electronic equipment, so as to simplify the flow of image identification. The specific technical scheme is as follows:

in a first aspect of the embodiments of the present invention, there is provided an image recognition method, including:

acquiring a plurality of image characteristics of an image to be identified under a plurality of different down-sampling magnifications;

for each down-sampling magnification in the plurality of different down-sampling magnifications, fusing the projections of the plurality of image features under the down-sampling magnification to obtain the fusion features of the image to be identified under the down-sampling magnification;

and determining the recognition result of the image to be recognized according to the fusion characteristics of the image to be recognized under all the down-sampling multiplying powers.

With reference to the first aspect, in a first possible implementation manner, the fusing, for each of the multiple different downsampling magnifications, the projections of the multiple image features at the downsampling magnification to obtain a fused feature of the to-be-identified image at the to-be-sampled magnification includes:

for each of the plurality of different down-sampling magnifications, repeatedly executing the following steps until the number of times of repeated execution reaches a preset number, wherein the preset number is not less than the number of the plurality of different down-sampling magnifications:

projecting the image features of the image to be identified at the down-sampling magnification which is adjacent to the down-sampling magnification to obtain projection features, wherein the adjacent down-sampling magnification is the down-sampling magnification which is adjacent to the down-sampling magnification when the down-sampling magnifications are sequenced from large to small or from small to large;

fusing the projection characteristics and the image characteristics of the image to be recognized under the down-sampling magnification to obtain new image characteristics of the image to be recognized under the down-sampling magnification;

and when the repeated execution times reach the preset times, taking the image features under each down-sampling magnification as the fusion features of the image to be identified under the down-sampling magnification.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the fusing the projection feature and the image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification includes:

if the image feature is not repeatedly executed for the first preset number of times, fusing the projection feature and the latest image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification;

and if the image feature is repeatedly executed for the first preset times, fusing the projection feature, the latest image feature of the image to be recognized under the down-sampling magnification and the initial image feature of the image to be recognized under the down-sampling magnification.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, the projecting the image feature of the image to be recognized at the down-sampling magnification that is adjacent to the down-sampling magnification to obtain a projection feature includes:

pooling the image characteristics of the adjacent down-sampling multiplying power which is smaller than the down-sampling multiplying power under the down-sampling multiplying power with the step length larger than 1 to obtain projection characteristics under the down-sampling multiplying power;

and performing up-sampling processing on the image features under the down-sampling multiplying power which is larger than the down-sampling multiplying power in the adjacent down-sampling multiplying power to obtain the projection features under the down-sampling multiplying power.

With reference to the first aspect, in a fourth possible implementation manner, the determining, according to the fusion features of the image to be recognized at all down-sampling magnifications, the recognition result of the image to be recognized includes:

inputting the fusion characteristics of the image to be recognized under all down-sampling magnifications into a pre-established recognition model for recognition to obtain a recognition result of the image to be recognized;

the recognition model comprises a plurality of sub-models in a plurality of target detection sub-models, a semantic segmentation sub-model, an instance segmentation sub-model and a posture point estimation sub-model;

the recognition model is trained by:

acquiring a plurality of image characteristics of a sample image under the plurality of different down-sampling magnifications, wherein the sample image is marked with a true value aiming at each sub-model;

for each of the plurality of different down-sampling magnifications, fusing the projections of the plurality of image features at the down-sampling magnification to obtain a fused feature of the sample image at the down-sampling magnification;

inputting the fusion characteristics of the image to be recognized under all down-sampling multiplying powers into each sub-model of the recognition model to obtain the predicted values output by all sub-models of the recognition model;

and aiming at each sub-model, adjusting the model parameters of the sub-model in a preset training mode aiming at the sub-model according to the loss between the predicted value output by the sub-model and the true value of the sample image for the sub-model.

In a second aspect of embodiments of the present invention, there is provided an image recognition apparatus, comprising:

the characteristic extraction module is used for acquiring a plurality of image characteristics of the image to be identified under a plurality of different down-sampling magnifications;

the feature fusion module is used for fusing the projections of the image features under the down-sampling magnification aiming at each down-sampling magnification in the different down-sampling magnifications to obtain the fusion features of the image to be identified under the down-sampling magnification;

and the identification module is used for determining the identification result of the image to be identified according to the fusion characteristics of the image to be identified under all the down-sampling multiplying powers.

With reference to the second aspect, in a first possible implementation manner, the feature fusion module is specifically configured to, for each downsampling magnification of the multiple different downsampling magnifications, repeatedly execute the following steps until the number of times of repeated execution reaches a preset number, where the preset number is not less than the number of the multiple different downsampling magnifications:

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the feature fusion module is specifically configured to fuse the projection feature and the latest image feature of the image to be recognized at the down-sampling magnification if the projection feature and the latest image feature of the image to be recognized at the down-sampling magnification are not repeatedly executed for a preset number of times, so as to obtain a new image feature of the image to be recognized at the down-sampling magnification;

With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the feature fusion module is specifically configured to perform pooling processing with a step length greater than 1 on image features at a down-sampling magnification that is smaller than the down-sampling magnification among adjacent down-sampling magnifications, so as to obtain projection features at the down-sampling magnification;

With reference to the second aspect, in a fourth possible implementation manner, the identification module is specifically configured to input the fusion features of the image to be identified at all down-sampling magnifications to a pre-established identification model for identification, so as to obtain an identification result of the image to be identified;

the recognition model is trained by:

In a third aspect of embodiments of the present invention, there is provided an electronic device, including:

a memory for storing a computer program;

a processor adapted to perform the method steps of any of the above first aspects when executing a program stored in the memory.

In a fourth aspect of embodiments of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the method steps of any one of the above-mentioned first aspects.

According to the image identification method, the image identification device and the electronic equipment, the fusion characteristics which simultaneously comprise complete texture information and semantic information can be obtained by fusing the image characteristics under different sampling multiplying powers, so that the fusion characteristics can be suitable for various different image identification tasks, namely different image identification tasks can be completed through the same flow, and the image identification flow is simplified. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of an image recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an image recognition framework according to an embodiment of the present invention;

FIG. 3a is a schematic structural diagram of a feature fusion framework according to an embodiment of the present invention;

FIG. 3b is a schematic diagram of another embodiment of a feature fusion framework;

FIG. 4 is a schematic diagram of a feature fusion framework provided by an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an image recognition framework training method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of an image recognition method according to an embodiment of the present invention, which may include:

s101, acquiring a plurality of image characteristics of an image to be identified under a plurality of different down-sampling magnifications.

The image to be recognized may be different according to different application scenes, and is not particularly limited to a certain image. A plurality of different down-sampling magnifications may also be different according to different application scenarios, and for convenience of description, it is assumed that the plurality of different down-sampling magnifications are respectively: 2 times, 4 times, 8 times, 16 times and 32 times.

The scale of the image features at the x-fold down-sampling magnification is one x times of the image to be recognized, for example, assuming that the resolution of the image to be recognized is 800 × 600, the resolution of the image features at the 2-fold down-sampling is 400 × 300. The image features under different down-sampling magnifications can be obtained by performing pooling processing on the image to be identified by utilizing a plurality of serially connected pooling layers. A plurality of different down-sampling multiplying powers are respectively as follows: for example, 2 times, 4 times, 8 times, 16 times, and 32 times, 5 pooling layers connected in series may be used to continuously perform pooling processing with a step length of 2 on the image to be recognized, the input of each pooling layer is the output of the previous pooling layer, the output of the first pooling layer is the image feature under 2 times down-sampling, the output of the second pooling layer is the image feature under 4 times down-sampling, and so on.

It can be understood that the lower the down-sampling magnification, the less semantic information is included in the image feature and the more texture information is included, and the higher the down-sampling magnification, the less texture information is included in the image feature and the more semantic information is included.

S102, aiming at each of a plurality of different down-sampling magnifications, fusing the projections of a plurality of image features under the down-sampling magnification to obtain the fusion features of the image to be recognized under the down-sampling magnification.

The projection of an image feature at a down-sampling magnification may refer to an image feature obtained by up-sampling or down-sampling the image feature to a scale corresponding to the down-sampling magnification. For example, assuming that the resolution of the image features of the image to be recognized is 400 × 300 at the 2-fold down-sampling rate, the resolution of the image features of the image to be recognized is 200 × 150 at the 4-fold down-sampling rate. The projection of the image feature at the 2-time down-sampling rate at the 4-time down-sampling rate may be 200 × 150 image feature obtained by performing 2-time down-sampling on the image feature at the 2-time down-sampling rate, or 200 × 150 image feature obtained by performing 4-time down-sampling on the image feature at the 2-time down-sampling rate and then performing 2-time up-sampling on the image feature, which is not limited in this embodiment.

Each fusion feature obtained by fusion can include more complete texture information in the image feature under low sampling magnification, and can also include more complete semantic information in the image feature under high sampling magnification.

S103, determining the recognition result of the image to be recognized according to the fusion characteristics of the image to be recognized under all the down-sampling magnifications.

The fusion characteristics of the image to be recognized under all the down-sampling magnifications can be input into the recognition model to obtain the recognition result output by the recognition model, wherein the recognition model is a model which is trained in advance and used for realizing the mapping from the characteristics to the recognition result. The model can be a neural network obtained based on deep learning, and can also be an algorithm model obtained based on traditional machine learning, and the embodiment does not limit the treelet.

By adopting the embodiment, the fusion characteristics which simultaneously comprise complete texture information and semantic information can be obtained by fusing the image characteristics under different sampling multiplying powers, so that the fusion characteristics can be suitable for various different image recognition tasks, namely different image recognition tasks can be completed through the same flow, and the image recognition flow is simplified.

In the related art, because the image features cannot simultaneously include complete texture information and semantic information, the image features obtained by feature extraction cannot be applied to multiple different image recognition tasks, and therefore different image recognition networks need to be respectively designed for different image recognition tasks, which results in a complicated image recognition process, in view of this, an embodiment of the present invention provides a unified image recognition framework, and the structure of the image recognition framework can be as shown in fig. 2, including: a feature extraction model 210, a feature fusion model 220, and an image recognition model 230.

The feature extraction model 210 is configured to extract image features of an input image at a plurality of different down-sampling magnifications, and input the image features into the feature fusion model 220, and the feature fusion model 220 is configured to fuse projections of the input image features at the respective down-sampling magnifications to obtain fusion features at the respective down-sampling magnifications, and input the fusion features at the respective down-sampling magnifications into the image recognition model 230.

The image recognition model 230 may include a target detection sub-model 231, a semantic segmentation sub-model 232, an instance segmentation sub-model 233, and a pose point estimation sub-model 234, and in other possible embodiments, some (one or more) but not all of these sub-models may be included in the image recognition model 230.

When the image recognition model 230 includes only one sub-model, the fused features at the respective down-sampling magnifications may be input to the sub-model, and the output of the sub-model may be used as the recognition result. When the image recognition model 230 includes a plurality of submodels, the fusion features at the respective down-sampling magnifications may be input to each submodel, and the output of all the submodels may be used as the recognition result.

By adopting the embodiment, the obtained fusion characteristics can be fully utilized to be suitable for various different image recognition tasks, so that a plurality of submodels for realizing different image recognition tasks share the same characteristic extraction model and the same characteristic fusion model, various different image recognition tasks can be realized by using the same frame, and the calculated amount is effectively saved.

For more clearly explaining the image recognition method provided by the embodiment of the present invention, feature fusion in the image recognition method provided by the embodiment of the present invention will be described below with reference to the feature fusion model in the image recognition framework shown in fig. 2.

For convenience of description, a plurality of different down-sampling magnifications are respectively: 2 times, 4 times, 8 times, 16 times, and 32 times as an example, the structure of the feature fusion model may include five rows and multiple columns of cells as shown in fig. 3a, where a cell labeled S2 indicates that the cell is used for processing image features at a sampling rate of 2 times, a cell labeled S4 indicates that the cell is used for processing image features at a sampling rate of 4 times, a cell labeled S8 indicates that the cell is used for processing image features at a sampling rate of 8 times, a cell labeled S16 indicates that the cell is used for processing image features at a sampling rate of 16 times, and a cell labeled S32 indicates that the cell is used for processing image features at a sampling rate of 32 times. In other possible embodiments, the feature fusion model may have other structures, which is not limited by this embodiment.

The horizontal arrows indicate convolution operations, and may be convolution processing with step size 1 using a convolution kernel of arbitrary size. The oblique upward arrow indicates the up-sampling process, and may be, for example, a nearest neighbor interpolation process or a bilinear interpolation process. The downward slanted arrow indicates the down-sampling process, which may be a pooling process with a step size of 2, for example.

The first column of the structure may be regarded as an input, and the last column may be regarded as an output, that is, the cells in the first column may represent the initial image features of the image to be recognized at the corresponding down-sampling magnification, for example, the cells in the first row and the first column are the initial image features of the image to be recognized at 2 times down-sampling, the cells in the second row and the first column are the initial image features of the image to be recognized at 4 times down-sampling, and so on.

The other columns than the first column may be regarded as repeatedly performing feature fusion. For example, the unit in the first row and the second column is a new image feature of the image to be recognized after the image is subjected to the feature fusion once under 2 times downsampling, the unit in the first row and the third column is a new image feature of the image to be recognized after the image is subjected to the feature fusion twice under 2 times downsampling, and so on.

In order to ensure that the initial image features of the image to be recognized at each down-sampling magnification are fused in the output fusion features, so that the fusion features include as much image information as possible, in this embodiment, the number of times of repeating image fusion should be not less than the number of a plurality of different down-sampling magnifications, and taking the application scenario as an example, the number of times of repeating image fusion should be not less than 5 times.

Taking the unit in the second row and the second column in the figure as an example, the unit is a new image feature of the image to be recognized after the secondary feature fusion under 2 times of downsampling, and the new image feature is obtained by the following steps:

step 1, carrying out down-sampling on units in a first row and a first column to obtain projection characteristics.

And 2, performing convolution processing on the units in the second row and the first column to obtain the image characteristics of the image to be processed at the sampling rate of 4 times.

And 3, performing up-sampling on the units in the third row and the first column to obtain projection characteristics.

And 4, fusing all the characteristics obtained in the steps 1-3 to obtain units in a second row and a second column.

For a clearer description, the principle of the structure will be explained in principle below, and reference may be made to fig. 4, which includes:

s401, projecting the image characteristics of the image to be recognized at the down-sampling magnification which is adjacent to the down-sampling magnification to obtain projection characteristics.

And the adjacent down-sampling multiplying power is the down-sampling multiplying power adjacent to the down-sampling multiplying power when the plurality of down-sampling multiplying powers are sequenced from large to small or from small to large. For example, for 4-fold down-sampling, the adjacent down-sampling magnification is 2-fold down-sampling and 8-fold down-sampling, and for 2-fold down-sampling, the adjacent down-sampling magnification is 4-fold down-sampling.

S402, fusing the projection characteristics and the image characteristics of the image to be recognized under the down-sampling magnification to obtain new image characteristics of the image to be recognized under the down-sampling magnification.

Reference may be made to the description of fig. 3a, which is not repeated here.

And S403, returning to execute S401 until the repeated execution times reach the preset times.

The preset times is not less than the number of the different sampling multiplying factors, and the preset times corresponds to the number of other columns in the structure except the first column.

S404, taking the image features under each down-sampling magnification as the fusion features of the image to be identified under the down-sampling magnification.

By adopting the embodiment, the fusion of the image characteristics under different down-sampling magnifications can be realized by a relatively simple framework in a dense connection mode. However, as the number of times of repeated fusion increases, part of information in the initial image feature may be lost in the obtained fusion feature, and therefore, an embodiment of the present invention provides another feature fusion architecture, which may be as shown in fig. 3b, where a dotted line part represents a shortcut connection (shotcut).

The feature fusion architecture shown in fig. 3b is similar to the feature fusion architecture shown in fig. 3a in principle, and only differs from the operation rule of the last column. That is, the difference is that in principle, when the image is repeatedly executed for the first preset number of times, the feature fusion framework shown in fig. 3a still fuses the projection feature and the latest image feature of the image to be recognized at the down-sampling magnification, whereas in the feature fusion framework shown in fig. 3b, the projection feature, the latest image feature of the image to be recognized at the down-sampling magnification, and the initial image feature of the image to be recognized at the down-sampling magnification are fused. Namely, the initial image features of the image to be recognized under the down-sampling magnification are additionally fused, so that the information in the initial image features can be kept as much as possible in the output fused features.

The following describes the training process of the image recognition framework of fig. 2, and can refer to fig. 5, which includes:

s501, acquiring a plurality of image characteristics of the sample image under a plurality of different down-sampling magnifications.

The sample image is labeled with a true value for each sub-model, and taking fig. 2 as an example, the sample image is labeled with 4 true values, which are a true value of target detection, a true value of semantic segmentation, a true value of example segmentation, and a true value of attitude point estimation, respectively.

And S502, fusing the projections of the image features under the down-sampling magnification aiming at each down-sampling magnification in a plurality of different down-sampling magnifications to obtain the fusion features of the sample image under the down-sampling magnification.

This step is the same as S102 except that the object is changed from the image to be recognized to the sample image. Reference may be made to the foregoing description of S102, which is not repeated herein.

S503, inputting the fusion characteristics of the image to be recognized under all the down-sampling magnifications into each sub-model of the recognition model to obtain the predicted value output by all the sub-models of the recognition model.

Taking fig. 2 as an example, 4 predicted values can be obtained, which are respectively a predicted value of target detection, a predicted value of semantic segmentation, a predicted value of instance segmentation, and a predicted value of attitude point estimation

S504, aiming at each sub-model, according to the loss between the predicted value output by the sub-model and the true value of the sample image for the sub-model, the model parameters of the sub-model are adjusted through a preset training mode aiming at the sub-model.

For example, the target detection submodel may be trained in a one-stage manner such as YOLO and SSD, or in a two-stage manner such as fast-RCNN, the semantic segmentation submodel and the instance segmentation submodel may be trained in a cross entropy loss manner, and the pose point estimation submodel may be trained in an L2 loss manner. In other possible embodiments, training may also be performed in other manners, which is not limited by this embodiment.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image recognition apparatus provided in real time according to the present invention, which may include:

the feature extraction module 601 is configured to obtain a plurality of image features of an image to be identified at a plurality of different down-sampling magnifications;

a feature fusion module 602, configured to fuse, for each down-sampling magnification of multiple different down-sampling magnifications, projections of multiple image features at the down-sampling magnification to obtain a fusion feature of the image to be identified at the down-sampling magnification;

the identifying module 603 is configured to determine an identifying result of the image to be identified according to the fusion features of the image to be identified at all down-sampling magnifications.

In a possible embodiment, the feature fusion module 602 is specifically configured to, for each of a plurality of different down-sampling magnifications, repeatedly perform the following steps until the number of times of repeated execution reaches a preset number, where the preset number is not less than the number of the plurality of different down-sampling magnifications:

projecting image features of an image to be identified at a down-sampling magnification adjacent to the down-sampling magnification to obtain projection features, wherein the adjacent down-sampling magnification is the down-sampling magnification adjacent to the down-sampling magnification when a plurality of down-sampling magnifications are sequenced from large to small;

In a possible embodiment, the feature fusion module 602 is specifically configured to, if the execution is not repeated for the first preset number of times, fuse the projection feature and the latest image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification;

In a possible embodiment, the feature fusion module 602 is specifically configured to perform pooling processing with a step length greater than 1 on image features at a down-sampling magnification smaller than the down-sampling magnification among adjacent down-sampling magnifications to obtain projection features at the down-sampling magnification;

In a possible embodiment, the recognition module 603 is specifically configured to input the fusion features of the image to be recognized at all down-sampling magnifications to a recognition model, so as to obtain a recognition result output by the recognition model, where the recognition model is a model trained in advance and used for implementing mapping from the features to the recognition result.

In one possible embodiment, the recognition model comprises a plurality of submodels among a plurality of object detection submodels, a semantic segmentation submodel, an instance segmentation submodel, and a pose point estimation submodel;

the recognition model is trained by the following means:

acquiring a plurality of image characteristics of a sample image under a plurality of different down-sampling magnifications, wherein the sample image is marked with a true value aiming at each sub-model;

for each down-sampling magnification in a plurality of different down-sampling magnifications, fusing the projections of a plurality of image features under the down-sampling magnification to obtain the fusion features of the sample image under the down-sampling magnification;

An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including:

a memory 701 for storing a computer program;

the processor 702 is configured to implement the following steps when executing the program stored in the memory 701:

for each down-sampling magnification in a plurality of different down-sampling magnifications, fusing the projections of a plurality of image features under the down-sampling magnification to obtain the fusion features of the image to be identified under the down-sampling magnification;

In a possible embodiment, for each of a plurality of different down-sampling magnifications, fusing projections of a plurality of image features at the down-sampling magnification to obtain a fused feature of an image to be recognized at the to-be-sampled magnification, including:

for each of a plurality of different down-sampling magnifications, repeatedly executing the following steps until the number of times of repeated execution reaches a preset number, wherein the preset number is not less than the number of the plurality of different down-sampling magnifications:

In a possible embodiment, fusing the projection feature and the image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification, including:

In a possible embodiment, projecting an image feature of an image to be recognized at a down-sampling magnification adjacent to the down-sampling magnification to obtain a projection feature includes:

In a possible embodiment, determining the recognition result of the image to be recognized according to the fusion features of the image to be recognized at all the down-sampling magnifications includes:

and inputting the fusion characteristics of the image to be recognized under all the down-sampling multiplying powers into a recognition model to obtain a recognition result output by the recognition model, wherein the recognition model is a model which is trained in advance and used for realizing the mapping from the characteristics to the recognition result.

the recognition model is trained by the following means:

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform any of the image recognition methods of the above embodiments.

In a further embodiment, the present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the image recognition methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, the computer-readable storage medium, and the computer program product, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An image recognition method, characterized in that the method comprises:

2. The method according to claim 1, wherein the fusing the projections of the image features at the down-sampling magnification for each of the plurality of different down-sampling magnifications to obtain the fused feature of the image to be recognized at the to-be-sampled magnification comprises:

3. The method according to claim 2, wherein the fusing the projection feature with the image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification comprises:

4. The method of claim 2, wherein projecting the image feature of the image to be recognized at the down-sampling magnification adjacent to the down-sampling magnification to obtain the projected feature comprises:

pooling the image characteristics of the adjacent down-sampling multiplying power which is smaller than the down-sampling multiplying power under the down-sampling multiplying power with the step length larger than 1 to obtain projection characteristics under the down-sampling multiplying power; or the like, or, alternatively,

5. The method according to claim 1, wherein the determining the recognition result of the image to be recognized according to the fusion features of the image to be recognized at all the down-sampling magnifications comprises:

the recognition model comprises a plurality of submodels in a target detection submodel, a semantic segmentation submodel, an instance segmentation submodel and an attitude point estimation submodel;

the recognition model is trained by:

6. An image recognition apparatus, characterized in that the apparatus comprises:

7. The apparatus according to claim 6, wherein the feature fusion module is specifically configured to, for each of the plurality of different down-sampling magnifications, repeatedly perform the following steps until the number of times of repeated execution reaches a preset number, where the preset number is not less than the number of the plurality of different down-sampling magnifications:

8. The apparatus according to claim 7, wherein the feature fusion module is specifically configured to, if the execution is not repeated for a preset number of times, fuse the projection feature and a latest image feature of the image to be recognized at the down-sampling magnification to obtain a new image feature of the image to be recognized at the down-sampling magnification;

9. The apparatus according to claim 7, wherein the feature fusion module is specifically configured to perform pooling processing with a step length greater than 1 on image features at a down-sampling magnification that is smaller than the down-sampling magnification among adjacent down-sampling magnifications to obtain projection features at the down-sampling magnification; or the like, or, alternatively,

10. The device according to claim 6, wherein the recognition module is specifically configured to input the fusion features of the image to be recognized at all down-sampling magnifications into a pre-established recognition model for recognition, so as to obtain a recognition result of the image to be recognized;

the recognition model is trained by:

11. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.

12. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-5.