WO2023060575A1

WO2023060575A1 - Image recognition method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023060575A1
Application number: PCT/CN2021/124169
Authority: WO
Inventors: 许震宇; 张锲石; 程俊; 康宇航; 任子良
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-04-20

Abstract

An image recognition method, which is applicable to the technical field of image processing and comprises: segmenting into a preset number of block images an image to be processed, wherein the image to be processed comprises an image to be recognized and a reference image; performing feature extraction processing on each block image to obtain block feature information corresponding to each block image (S102); determining a similarity weight between the image to be recognized and the reference image according to position information of each block image and the corresponding block feature information (S103); determining a target similarity between the image to be recognized and the reference image according to each piece of block feature information and the similarity weight (S104); and determining, according to the target similarity, a recognition result of the image to be recognized (S105). The present invention can improve the robustness and accuracy of image recognition.

Description

Image recognition method, device, electronic device and storage medium

technical field

The present application belongs to the technical field of image processing, and in particular relates to an image recognition method, device, electronic equipment and storage medium.

Background technique

As an important field of artificial intelligence, image recognition is widely used in unmanned systems. Generally, the image recognition process includes: collecting an image corresponding to an area currently to be recognized; and determining an image recognition result according to the similarity between the collected image and the pre-stored image in the image library. However, due to the impact of shooting conditions, the robustness and accuracy of current image recognition are low.

technical problem

Embodiments of the present application provide an image recognition method, device, electronic device, and storage medium, so as to solve the problem of low robustness and low accuracy of image recognition in the prior art.

technical solution

In order to solve the above-mentioned technical problems, the technical solution adopted in the embodiment of the present application is:

The first aspect of the embodiments of the present application provides an image recognition method, including:

Dividing the image to be processed into a preset number of block images; wherein, the image to be processed includes an image to be identified and a reference image, the block image corresponding to the image to be identified is a block image to be identified, the The block image corresponding to the reference image is a reference block image; each of the block images to be identified has a one-to-one corresponding position information of the reference block image;

performing feature extraction processing on each of the block images respectively, to obtain block feature information corresponding to each of the block images;

determining a similarity weight between the image to be identified and the reference image according to the position information of each of the block images and the corresponding block feature information;

determining the target similarity between the image to be recognized and the reference image according to each of the block feature information and the similarity weight;

A recognition result of the image to be recognized is determined according to the target similarity.

The second aspect of the embodiments of the present application provides an image recognition device, including:

A segmentation unit, configured to divide the image to be processed into a preset number of block images; wherein, the image to be processed includes an image to be recognized and a reference image, and the block image corresponding to the image to be recognized is a block image to be recognized A block image, the block image corresponding to the reference image is a reference block image; each of the block images to be identified has a one-to-one corresponding position information of the reference block image;

A feature extraction unit, configured to perform feature extraction processing on each of the block images to obtain block feature information corresponding to each of the block images;

A similarity weight determining unit, configured to determine a similarity weight between the image to be recognized and the reference image according to the position information of each of the block images and the corresponding block feature information;

A target similarity determining unit, configured to determine the target similarity between the image to be recognized and the reference image according to each of the block feature information and the similarity weight;

The recognition result determination unit is configured to determine the recognition result of the image to be recognized according to the target similarity.

The third aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program When, the electronic device is made to implement the steps of the image recognition method.

The fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the electronic device realizes the image recognition method steps.

A fifth aspect of the embodiments of the present application provides a computer program product, which, when the computer program product is run on an electronic device, causes the electronic device to execute the image recognition method described in the first aspect above.

Beneficial effect

Since the image to be processed can be segmented first to obtain the block image and then the feature extraction is performed, the detailed feature information of the image can be extracted more accurately, so that the subsequent similarity calculation can be performed more accurately according to the feature information of each block, thereby improving The accuracy of image recognition. Moreover, since the block feature information of the block images located at different positions in the image to be processed can represent the features of the image to be processed under different viewing angles, the similarity obtained based on the position information and block feature information of each block image is determined The weight can reflect the similarity information of the image to be recognized and the reference image under different viewing angles. Therefore, the target similarity obtained based on the similarity weight is the similarity obtained by considering the robustness of viewing angle changes. According to the target similarity The obtained recognition result is an accurate recognition result obtained by overcoming the influence of the angle of view change caused by the shooting angle of the camera, so that the robustness and accuracy of image recognition can be improved.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following briefly introduces the drawings that are required in the embodiments or the description of the prior art.

FIG. 1 is a schematic diagram of an implementation flow of an image recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a similarity weight construction process provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a matching process between an image to be recognized and a reference image provided in an embodiment of the present application;

FIG. 4 is a schematic flow diagram of training a target model based on a triplet method provided by an embodiment of the present application;

Fig. 5 is a schematic diagram of an image recognition device provided by an embodiment of the present application;

Fig. 6 is a schematic diagram of an electronic device provided by an embodiment of the present application.

Embodiment of this application

In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

Embodiment one:

Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of an image recognition method provided by an embodiment of the present application. The execution subject of the image recognition method in this embodiment is an electronic device, which includes but is not limited to smart phones, tablet computers, desktop computers, servers and other computing devices. The image recognition methods shown in Figure 1 include:

In S101, the image to be processed is divided into a preset number of block images; wherein, the image to be processed includes an image to be identified and a reference image, and the block image corresponding to the image to be identified is a block to be identified For an image, the block image corresponding to the reference image is a reference block image; each block image to be identified has a one-to-one correspondence of position information with the reference block image.

In the embodiment of the present application, the image to be recognized is an unknown image that needs to be recognized currently (that is, the entity information contained in the image is unknown), and the reference image is a known image pre-stored in the image library (that is, the entity information contained in the image is unknown). A known). For the recognition process of the current image to be recognized, it can be summarized as follows: by comparing the features of the image to be recognized with the reference image, determine whether the image to be recognized matches the reference image, and if they match, use the known entity information in the reference image as the identification information of the image to be identified. In the embodiment of the present application, an image to be recognized and a reference image may be obtained and combined into a pair of images to be processed. The image to be recognized can be received from the shooting device, and the reference image can be obtained from a preset image library.

In the embodiment of the present application, the preset number is a value set for extraction, for example, the preset number may be 4. In one embodiment, the above-mentioned division is specifically equal division. That is, for each image to be processed, it is equally divided according to a preset number to obtain a preset number of equally divided block images corresponding to the image to be processed.

In one embodiment, the image to be recognized is equally divided into a preset number of parts to obtain a preset number of block images corresponding to the image to be recognized, and the block image corresponding to the image to be recognized is called a block image to be recognized block image. The reference image is equally divided into a preset number of parts to obtain a preset number of block images corresponding to the reference image, and the block image corresponding to the reference image is called a reference block image. For each block image to be recognized in the image to be recognized, there is a reference block image with one-to-one correspondence of position information in the reference image.

In S102, feature extraction processing is performed on each of the block images to obtain block feature information corresponding to each of the block images.

In the embodiment of the present application, after dividing and obtaining each segmented image, the feature extraction process is performed on the segmented image to be identified to obtain the feature information corresponding to the segmented image to be identified, and the feature information is referred to as the segment to be identified characteristic information. Feature extraction processing is performed on a reference block image in the block image to obtain feature information corresponding to the reference block image, and the feature information is referred to as reference block feature information.

In one embodiment, the feature extraction processing of the segmented image can be realized by a pre-trained neural network model.

In one embodiment, each segmented image obtained by segmentation may be subjected to feature extraction processing one by one in sequence. In another embodiment, by enabling multi-threading, the feature extraction process can be performed on more than one (or even all the block images) at a time, thereby improving the efficiency of the feature extraction process.

In S103, according to the position information of each of the block images and the corresponding block feature information, determine the similarity weight between the image to be recognized and the reference image.

In the embodiment of the present application, the location information of the block image refers to information about the location of the block image in the original image to be processed. In one embodiment, the location information can be represented by (j,k), where j indicates that the block image is located in the jth row of the image to be processed, and k indicates that the block image is located in the kth column of the image to be processed middle. In another embodiment, the block images obtained by segmenting the image to be processed are numbered sequentially from left to right and then from top to bottom, and the number information i is used as the position information of the block images. Exemplarily, the image to be processed is divided into four divided images, and the position information of each divided image is represented by 1, 2, 3, and 4 in sequence.

For each block image obtained by segmentation, according to the position information of the block image in the original image to be processed and the block feature information corresponding to the block image, obtain the block image to be identified and the reference block image in the block image The similarity between each block image, the similarity between each block image to be recognized, and the similarity between each reference image, and these similarities are combined to obtain the similarity relationship between each block image. According to the similarity relationship, the similarity weight between the image to be recognized and the reference image can be determined, and the similarity weight can represent the similarity relation information between the block images.

In S104, according to each of the block feature information and the similarity weight, determine the target similarity between the image to be recognized and the reference image.

The block feature information includes the feature information of the block to be identified and the feature information of the reference block, and the similarity calculation between the image to be identified and the reference image is performed by combining the feature information of each block to be identified, the feature information of each reference block and the similarity weight , to get the target similarity.

In S105, a recognition result of the image to be recognized is determined according to the target similarity.

After the target similarity is obtained, the recognition result of the image to be recognized is determined according to the value of the target similarity. In one embodiment, if the target similarity is less than a preset similarity threshold, it is determined that the recognition result of the image to be recognized is a recognition failure. In another embodiment, if the target similarity is greater than or equal to the preset similarity threshold, it is determined that the recognition result of the image to be recognized is a successful recognition, and according to the known entity information in the reference image corresponding to the image to be recognized , to determine the identification information of the image to be identified. For example, if the entity information of the reference image corresponding to the image to be recognized is "puppy" (that is, it means that the reference image is an image corresponding to a puppy), then the identification information of the image to be recognized is "puppy"; The entity information of the reference image is "place A" (that is, the reference image is an image corresponding to place A), and the identification information of the image to be recognized is "place A".

In the embodiment of the present application, since the image to be processed can be segmented to obtain the block image first, and then the feature extraction is performed, the detailed feature information of the image can be extracted more accurately, so that the follow-up can be performed more accurately according to the feature information of each block. Similarity calculation, thereby improving the accuracy of image recognition. Moreover, since the block feature information of the block images located at different positions in the image to be processed can represent the features of the image to be processed under different viewing angles, the similarity obtained based on the position information and block feature information of each block image is determined The weight can reflect the similarity information of the image to be recognized and the reference image under different viewing angles. Therefore, the target similarity obtained based on the similarity weight is the similarity obtained by considering the robustness of viewing angle changes. According to the target similarity The obtained recognition result is an accurate recognition result obtained by overcoming the influence of the angle of view change caused by the shooting angle of the camera, so that the robustness and accuracy of image recognition can be improved.

Optionally, the above step S102 includes:

For each of the block images, input the block images into a trained convolutional neural network for processing to obtain initial feature information corresponding to the block images;

Perform dimensionality reduction processing on the initial feature information to obtain block feature information corresponding to the block image.

In the embodiment of the present application, feature extraction processing is performed on the segmented image through a trained convolutional neural network. Each block image is sequentially input into the trained convolutional neural network, and the initial feature information corresponding to the block image is obtained through the convolution processing of each layer of the convolutional neural network.

In one embodiment, the trained convolutional neural network may be AlexNet. First, for the block image, the image size is scaled to 3×224×224 to obtain an image I _m in RGB (Red, Green, Blue, red, green, blue) color mode, and the image _Im is input to AlexNet for feature extraction deal with. Since the feature location recognition effect of the fifth convolutional layer (Conv_5) output of AlexNet is the best, the feature with a dimension of 256×6×6 output by Conv_5 is extracted as the initial feature information to improve the accuracy of feature extraction.

Since the dimensionality of the initial feature information initially extracted by the convolutional neural network is usually large, subsequent similarity calculations and comparisons directly based on the initial feature information will consume a large amount of computing resources and storage memory. Therefore, after the initial feature information is obtained, dimensionality reduction processing is performed on the initial feature information to obtain feature information with a lower dimension as the block feature information of the block image. This dimensionality reduction process can be implemented by operations such as pooling or downsampling.

In the embodiment of the present application, since the feature of the block image can be accurately extracted through the convolutional neural network, and the block feature information of the block image can be obtained through dimensionality reduction processing, it is possible to ensure the accuracy of feature extraction while , reduce system resource consumption for subsequent calculations, and improve image recognition efficiency. Moreover, since the block feature information after dimensionality reduction processing has certain robustness to the feature information with too high dimensionality, it can still Based on this robustness, the success rate of image recognition can be improved.

Optionally, performing dimensionality reduction processing on the initial feature information to obtain block feature information corresponding to the block image.

Inputting the initial feature information into an adaptive average pooling layer and/or a trained dimensionality reduction autoencoder for dimensionality reduction processing to obtain block feature information corresponding to the block image.

Generally, the pooling operation can reduce the dimensionality of feature information through simple calculations, while ensuring the accuracy of subsequent recognition. The average pooling operation captures all the details of the entire scene of the image. Compared with the maximum pooling operation, it can better extract the features of the image scene. In the embodiment of the present application, the Adaptive Average Pooling (AAP, Adaptive Average Pooling) layer is a processing layer that adaptively performs an average pooling operation on the input feature information of a specified size. Therefore, the initial feature information is input The adaptive average pooling layer can reduce the dimensionality of the initial feature information.

The dimensionality reduction autoencoder in the embodiment of the present application is specifically an autoencoder for reducing the dimension of feature information. Autoencoder (autoencoder, AE) is a type of artificial neural network (Artificial Neural Networks, ANNs) used in semi-supervised learning and unsupervised learning. Its function is to perform representation learning on input information by using input information as a learning target. (representation learning). The self-encoder consists of two parts: an encoder (Encoder) and a decoder (Decoder). In the autoencoder, the encoder can encode the input feature information to achieve information compression and reduce the dimension of the feature information; and the decoder in the autoencoder can restore the feature information compressed by the encoder. When performing image recognition processing, specifically, the encoder part of the dimensionality reduction autoencoder is used to realize dimensionality reduction processing of feature information, so that the feature information based on the output of the encoder can achieve more robust and efficient Image Identification.

In one embodiment, the initial feature information of the block image can be input into the adaptive average pooling layer for processing, and then further input into the dimensionality reduction autoencoder for processing, so as to obtain the block feature information corresponding to the block image with a lower dimension . Exemplarily, the initial feature information with a feature dimension of 256×6×6 is input into the adaptive average pooling layer for average pooling processing, and expanded into a one-dimensional vector by the expansion function Flatten( ), and the dimension is (1× 4096) pooling features. Afterwards, the pooled feature is input into the dimensionality reduction autoencoder for processing, and the feature information with a dimension of 1×256 output by the encoder in the dimensionality reduction autoencoder is obtained as the block feature information of the block image.

In the embodiment of the present application, through the adaptive average pooling layer and/or the dimensionality reduction autoencoder, the dimension of the initial feature information can be efficiently and accurately reduced, and the efficiency and accuracy of subsequent image recognition can be improved.

Optionally, before inputting the block image into a trained convolutional neural network for processing for each block image to obtain initial feature information corresponding to the block image, the method further includes:

Obtaining preset sample feature information, and inputting the preset sample feature information into a dimensionality reduction autoencoder to be trained;

Adjusting the parameters of the dimensionality reduction autoencoder to be trained, so that the mean square error between the preset sample feature information and the decoding feature information output by the decoder of the dimensionality reduction autoencoder is less than a preset threshold, Get the trained dimensionality reduction encoder.

In the embodiment of the present application, when processing the initial feature information, it is specifically necessary to use a trained dimensionality reduction autoencoder to reduce the feature dimension. The trained dimensionality reduction autoencoder is the dimensionality reduction autoencoder to be trained. obtained after training.

Specifically, a preset number of preset sample feature information may be obtained, and the preset sample feature information is input into the dimensionality reduction autoencoder to be trained, and the training of the dimensionality reduction autoencoder to be trained is started. Wherein, the preset sample characteristic information is the characteristic information of the block sample image obtained by performing segmentation and feature extraction processing on the sample image in advance.

After inputting the feature information of preset samples into the dimensionality reduction autoencoder to be trained, the decoding feature information output by the decoder part of the dimensionality reduction autoencoder is obtained. The decoded feature information is feature information obtained by decoding and restoring the encoded feature information output by the encoder of the dimensionality-reduced self-encoder. Through the preset mean square error loss function, calculate the mean square error between the input preset sample feature information and the decoded feature information restored by the decoder, and reversely adjust the dimensionality reduction autoencoder to be trained according to the calculation result parameters, until the mean square error between the preset sample feature information and the feature information output by the decoder of the dimensionality reduction autoencoder is less than the preset threshold, stop the current training and obtain the trained dimensionality reduction autoencoder.

In the embodiment of the present application, the dimensionality reduction autoencoder is trained in advance, and an accurate trained dimensionality reduction autoencoder can be obtained, so that the subsequent dimensionality reduction processing can be accurately performed on the initial feature information according to the dimensionality reduction autoencoder, Improve the accuracy of image recognition.

Optionally, the above step S103 includes:

According to the position information of each of the block images and the corresponding block feature information, a paired similarity feature vector and an unpaired similarity feature vector are determined; wherein, the paired similarity feature vector includes position information relative The corresponding similarity information between the block image to be identified and the reference block image; the unpaired similarity feature vector includes similarity information between the block images whose position information does not correspond;

The similarity weight is determined according to the paired similarity feature vector and the non-paired similarity feature vector.

In the embodiment of the present application, since the image to be recognized and the reference image are divided into a preset number of block images according to the same segmentation method, for each block image to be recognized of the image to be recognized, there are The reference tiled image for . For example, assume that the image to be recognized I ₁ is divided into four block images to be recognized I ₁₁ , I ₁₂ , I ₁₃ , and I ₁₄ , and the reference image I ₂ is divided into four corresponding reference block images I ₂₁ , I ₂₂ _. _{_} _{_} _{_} _{_} ₂₂ , the reference block image corresponding to the position information corresponding to the block image I ₁₃ to be identified is I ₂₃ , and the reference block image corresponding to the position information corresponding to the block image I ₁₄ to be identified is I ₂₄ . Among them, the first subscript of I is used to distinguish which image to be processed the current block image belongs to; the second subscript of I is used to distinguish which block image the current block image is the image to be processed , which may reflect the location information of the block image in the image to be processed.

For each group of block images to be identified and reference block images corresponding to the same position information above, the similarity calculation is performed according to the block feature information of the two, and the corresponding block images corresponding to each group of position information can be obtained. Each pairwise similarity. Each pairwise similarity is combined to obtain a pairwise similarity feature vector.

In addition to the combination of block images corresponding to the above-mentioned position information, the similarity between block images whose positions do not correspond is called non-pairwise similarity. Specifically, the non-pairwise similarity includes: each block image to be identified is different from the reference block image whose position information is different (for example, the above second block image to be identified and the reference block image with different subscripts) The similarity between the block images to be recognized in each different position (such as the different block images to be recognized obtained by segmenting the same image to be recognized), and the reference block images in each different position (such as the same The similarity between different reference block images obtained by segmenting the reference image). These unpaired similarities can be combined to obtain unpaired similarity feature vectors.

In one embodiment, for the block feature information x and block feature information y corresponding to two different block images, the similarity calculation of the two block images can be performed by the preset formula (1), This formula (1) is as follows:

Where C(x, y) represents the normalized cosine similarity calculated from the block feature information x of the first block image and the block feature information y of the second block image. || || indicates the norm operation, * indicates the multiplication sign, and the value range of the similarity obtained by solving the above formula (1) is [0,1]. Through this similarity calculation method, the value of the similarity between block images can be normalized to a range from 0 to 1, so as to facilitate subsequent calculations.

After determining the paired similarity feature vector and the unpaired similarity feature vector, the paired similarity feature vector and the unpaired similarity feature vector are combined, and the obtained vector is called the similarity relationship vector. Since the similarity relationship vector contains the similarity between the block images of different positional relationships of the image to be processed, the similarity relationship vector can represent the similarity between the block images corresponding to the image to be recognized and the reference image relation.

After the similarity relationship vector is determined, a weighting operation is performed according to the values of each element of the relative relationship vector to obtain the similarity weight between the image to be recognized and the reference image. The similarity weight can accurately represent the impact of visual changes between the image to be recognized and the reference image on the image similarity.

In the embodiment of the present application, since the similarity relationship between the position-based block images can be accurately expressed through the paired similarity feature vector and the unpaired similarity feature vector, it can be solved based on the similarity relationship to obtain the applicable Based on the similarity weight of the image similarity calculation, the accuracy of the final similarity calculation is improved, thereby improving the accuracy of image recognition.

Optionally, the determining the similarity weight according to the paired similarity feature vector and the non-paired similarity feature vector includes:

The similarity weight is determined according to the paired similarity feature vector, the non-paired similarity feature vector, and a preset weight autoencoder.

In the embodiment of the present application, the weighted autoencoder is an autoencoder trained in advance for determining the weights of the similarity relationship vectors.

In one embodiment, after the paired similarity feature vectors and unpaired similarity feature vectors can be concatenated and combined into a similarity relationship vector V, the similarity relationship vector V is input into the preset weight self-encoder WAE for processing to obtain The weight WAE(V) corresponding to the similarity relationship vector. According to the weight WAE(V), the similarity relationship vectors are weighted and summed and then normalized to obtain a similarity weight with a value range of [0,1]. Exemplarily, the similarity weight L can be obtained by the following formula (2):

Among them, V is the similarity relationship vector obtained by splicing, WAE(V) is the weight of the similarity relationship vector V output by the weight self-encoder, e is the natural base, w and t are preset parameter values, where t is based on the above The preset number N is determined, for example, t=N ² .

In the embodiment of the present application, the current similarity weight can be accurately determined through the preset weight autoencoder, thereby improving the accuracy of subsequent image recognition.

Optionally, the unpaired similarity feature vector includes a first similarity feature vector and a second similarity feature vector, and the first similarity feature vector includes the block image to be identified and The similarity information between the reference block images, the second similarity feature vector includes similarity information between different block images in the same image to be processed;

Correspondingly, the determining the similarity weight according to the paired similarity feature vector, the non-paired similarity feature vector and the preset weight autoencoder includes:

Determine a first similarity weight according to the paired similarity feature vector, the first similarity feature vector, and a preset first weight autoencoder;

The second similarity weight is determined according to the paired similarity feature vector, the second similarity feature vector, and a preset second weight autoencoder.

In the embodiment of the present application, the aforementioned unpaired similarity feature vectors include a first similarity feature vector and a second similarity feature vector.

The first similarity feature vector includes similarity information between the block image to be identified and the reference block image whose position information does not correspond. For each block image to be recognized of the image to be recognized, any reference block image that does not correspond to the position number of the block image to be identified can be selected from the reference image to form a group of images, and the combination of Each group of obtained images is subjected to similarity calculation to obtain each first similarity, and these non-paired first similarities can be combined to obtain a first similarity feature vector. Due to the change of the shooting angle, the same actual physical position or the image position of the actual object in the image to be recognized is different from the image position in the reference image of another shooting angle. For example, the image region corresponding to building B is the upper left of the image to be recognized, and after image segmentation, the image region corresponding to building B is in the first block image _I11 to be recognized. In the reference image, the image area corresponding to the building B is at the lower right of the reference image, and after image segmentation, the image area corresponding to the building B is in the fourth reference block image I ₂₄ . However, the paired similarity eigenvectors calculated according to the position correspondence cannot reflect the similarity between images when the viewing angle changes. The similarity feature vector can represent the similarity between the block image to be recognized and the reference block image when the viewing angle changes.

In addition, for any two block images corresponding to the same image to be processed (including the same image to be recognized or the same reference image), a group of images can be combined to obtain a group of images. By performing similarity calculation, each second similarity can be obtained, and these second similarities can be combined to obtain a second similarity feature vector. For the same actual physical location or the image area occupied by the actual object in the image to be recognized may be relatively large, thus occupying different block images. For example, the image area corresponding to building D can occupy the second block image I ₁₂ to be recognized and the third block image I ₁₃ to be recognized in the image to be recognized at the same time; similarly, the image area corresponding to building D can simultaneously A second reference block image I ₂₂ and a third reference block image I ₂₃ among the reference images are occupied. In order to avoid the rigidity of image segmentation on the feature integrity of the original image, in the embodiment of the present application, the above-mentioned second similarity feature vector can represent the similarity between different image blocks in the same image to be processed, and keep different images Continuity of features between blocks.

Exemplarily, it is assumed that the feature information of the block to be identified corresponding to the block image to be identified whose number is i in the image to be identified is x _i (for example, the feature information of the block to be identified corresponding to the above block image _I to be identified is x ₁ ), the reference block feature information corresponding to the reference block image numbered j in a reference image is y _j (for example, the reference block feature information corresponding to the above-mentioned reference block image I ₂₁ is y ₁ ), then the above The pairwise similarity feature vector V _a can be expressed by the following formula (3):

V _a ={C(x _i ,y _j )}(i=j)

Among them, C( _xi ,y _j ) is the similarity between the feature information of the block to be identified and the feature information of the reference block corresponding to the position information calculated by the formula (1), {} Represents a set operation. Formula (3) indicates that the paired similarities corresponding to the paired images combined by the block image to be identified and the reference block image with the same number are combined to obtain the paired similarity feature vector V _a .

The above-mentioned first similarity feature vector V _b can be expressed by the following formula (4):

V _b ＝{C(x _i ,y _j )}(i≠j)

Among them, C( _xi , y _j ) is the similarity between the feature information of the block to be identified and the feature information of the reference block calculated by the formula (1) and the position information does not correspond (ie i≠j), {} Represents a set operation. The formula (4) indicates that the similarities corresponding to the groups of images composed of block images to be identified and reference block images at different positions are combined to obtain the first similarity feature vector V _b .

The above-mentioned second similarity feature vector V _c can be expressed by the following formula (5):

V _c ＝{C(x _i ,x _j ),C(y _i ,y _j )}(i≠j)

Among them, C(x _i , x _j ) is the similarity between the feature information x _i and x _j of the blocks to be recognized that do not correspond to the position information (that is, i≠j) in the same image to be recognized calculated by formula (1). degree, C(y _i , y _j ) is the similarity between reference block feature information y _i and y _j at different positions (ie i≠j) in the same reference image calculated by formula (1). Merge C( _xi , x _j ) and C(y _i , y _j ) to obtain the above-mentioned second similarity feature vector V _c .

In the above formula (3) to formula (5), the value range of i and j is a positive integer between 1 and N, and N represents the above-mentioned preset number (that is, the blocks into which an image to be processed is divided number of images).

After that, the paired similarity feature vector V _a and the first similarity feature vector V _b can be combined to obtain the first similarity relationship vector V ₁ , which can be expressed by the following formula (6):

V ₁ ={V _a ,V _b }

Combine the paired similarity feature vector V _a and the second similarity feature vector V _c to obtain the second similarity relationship vector V ₂ , which can be expressed by the following formula (7):

V ₂ ={V _a ,V _c }

Correspondingly, the similarity weight may include a first similarity weight and a second similarity weight. The first similarity weight Alpha is specifically determined according to the first similarity relationship vector V ₁ formed by the combination of the paired similarity feature vector V _a and the first similarity feature vector V _b , and the preset first weight autoencoder . Exemplarily, the first similarity weight Alpha can be obtained by the following formula (8):

Among them, V ₁ is the first similarity relationship vector, WAE ₁ (V ₁ ) is the weight of the first similarity relationship vector V ₁ obtained through the first weight autoencoder processing; e is the natural base, and w ₁ is the preset Values, such as w ₁ =−10; t is a preset parameter value, wherein t is determined according to the preset number N mentioned above, and t=N ² . When the preset number is 4, that is, when each image to be processed is divided into 4 block images, t=16 in formula (8). The value range of the first similarity weight Alpha obtained by formula (8) is [0,1].

The second similarity weight Beta is specifically determined according to the second similarity relationship vector V ₂ formed by combining the paired similarity feature vector V _a and the second similarity feature vector V _c , and the preset second weight autoencoder . Exemplarily, the second similarity weight Beta can be obtained by the following formula (9):

Among them, V ₂ is the second similarity relationship vector, WAE ₂ (V ₂ ) is the weight of the second similarity relationship vector V ₂ obtained through the second weight autoencoder processing; e is the natural base, and w ₂ is the preset Values, such as w ₂ =-10; t is a preset parameter value, wherein t is determined according to the preset number N mentioned above, and t=N ² . When the preset number is 4, that is, when each image to be processed is divided into 4 block images, t=16 in formula (9). The value range of the second similarity weight Beta obtained by formula (9) is [0,1].

The first similarity weight obtained by the above method can represent the similarity between the image to be recognized and the reference image under different viewing angles, and the second similarity weight can represent the continuity between the block images of the image to be processed, The subsequent target similarity calculated based on the first similarity weight and the second similarity weight can comprehensively consider the similarity obtained from the robustness of the image shooting angle, thereby improving the robustness and accuracy of image recognition.

Optionally, the above step S104 includes:

determining an initial similarity between the image to be recognized and the reference image according to each of the block feature information;

The initial similarity is multiplied by the similarity weight to obtain a target similarity between the image to be recognized and the reference image.

After determining the similarity weight, for each block image to be identified, the reference block images corresponding to the position of the block image to be identified are respectively obtained and combined into a group of images, and the to-be-recognized image in each group of images is obtained A similarity between the block feature information to be identified of the block image and the block feature information of the corresponding reference block image is identified. Next, the average calculation is performed on the similarities of the preset number of images determined based on the block images of the image to be recognized and the reference image to obtain the initial similarity between the image to be recognized and the reference image. After that, the initial similarity is multiplied by the similarity weight, and the obtained result is used as the target similarity between the image to be recognized and the reference image.

Exemplarily, the target similarity Similarity can be obtained by the following formula (10):

Among them, N is the preset number that the image is divided into, and C( _xi , y _i ) is the difference between the feature information of the block to be identified corresponding to the same position information and the feature information of the reference block calculated by formula (1). similarity,

Alpha is the above-mentioned first similarity weight, and Beta is the above-mentioned second similarity weight.

Through the above-mentioned method, it is possible to integrate the target similarity of information from different viewing angles, and improve the accuracy of image recognition.

Optionally, the image to be recognized is a scene image to be recognized, and the reference image is a reference scene image; the recognition result of the image to be recognized includes a visual position recognition result of the scene image to be recognized.

The image recognition method in the embodiment of the present application is specifically a visual position recognition method. The essence of visual location recognition is to judge whether the two images indicate the same location. This problem can be transformed into the problem of finding the similarity of the two images. When the two images are similar enough, the similarity is close to 1. The two images Images indicate the same location. Conversely, if the similarity between two images is close to -1, then the two images indicate different locations.

At present, in unmanned systems, visual position recognition has very important application value, and can be applied to various application scenarios such as positioning, remote monitoring, and vehicle navigation. Due to the influence of appearance changes caused by lighting, weather, and seasonal changes, and changes in perspective caused by camera shooting angle changes on image recognition, most current visual position recognition methods cannot accurately perform visual position recognition when unmanned systems encounter drastic environmental changes.

In the embodiment of the present application, the image of the scene to be recognized obtained by shooting the scene to be recognized is used as the image to be recognized, and the pre-stored reference scene image is used as the reference image. Through the above steps S101 to S105, the feature extraction and Feature dimensionality reduction, improve the robustness of image appearance changes, and ensure the robustness of image visual changes through similarity weights, accurately determine the reference scene image that matches the scene image to be recognized, so that the reference scene image carries The location information of the scene image to be recognized is used as the location information corresponding to the scene image to be recognized, and the visual position recognition result of the scene image to be recognized is obtained. That is, based on the above-mentioned image recognition method described in steps S101 to S105, the robustness requirements of visual position recognition under complex environment changes can be met, and the accuracy of visual position recognition can be improved.

Exemplarily, FIG. 2 shows a schematic diagram of the construction process of the above-mentioned similarity weight, and the construction process of the similarity weight corresponds to the above-mentioned step S101 to step S104. The process of forming the similarity weight is described in detail as follows:

A1: As shown in Figure 2, after obtaining the image to be recognized I ₁ and the reference image I ₂ as images to be processed, each image to be processed is divided into four corresponding block images. Specifically, the image I ₁ to be identified is divided into four block images I ₁₁ , I ₁₂ , I ₁₃ , and I ₁₄ to be identified, and the reference image I ₂ is divided into four corresponding reference block images I ₂₁ , I ₂₂ , I ₂₃ , I ₂₄ .

A2: Next, input each segmented image into AlexNet for feature extraction processing to obtain the initial feature information corresponding to each segmented image, and then input the dimensionality reduction autoencoder for dimensionality reduction processing, from the middle of the dimensionality reduction autoencoder The block feature information obtained by dimensionality reduction is obtained in the output layer of the encoder. _The block feature information _output by the encoder of the dimensionality reduction autoencoder specifically includes block feature information x ₁ _, x ₂ _, x ₃ , x ₄ , and reference block feature information y ₁ , y 2 , y ₃ _, y ₄ corresponding to the reference block images I ₂₁ , I ₂₂ , I ₂₃ , I ₂₄ respectively.

A3: After that, determine the similarity relationship between each block image according to the feature information of each block. Specifically, for the upper part of the similarity relationship shown in Figure 2, it means: for each block image to be identified, calculate the similarity between the block image to be identified and the reference block image with the same corresponding position degree information, combining these similarity information to obtain a paired similarity feature vector V _a ; and for each block image to be identified, calculating the similarity information between the block image to be identified and the reference block image whose position is different , combine these similarity information to obtain the first similarity feature vector V _b ; after that, the paired similarity feature vector V _a and the first similarity feature vector V _b are expanded into the first Similarity relationship vector V ₁ . For the lower part of the similarity relationship shown in Figure 2, it means that for each block image to be identified, the similarity information between the block image to be identified and the reference block image with the same corresponding position is calculated, and The similarity information is combined to obtain a paired similarity feature vector V _a ; and the similarity information between each two block images to be identified and the similarity information between each two block images are calculated respectively, and these similarities The information is combined to obtain the second similarity feature vector V _c ; after that, the paired similarity feature vector V _a and the second similarity feature vector V _c are combined and expanded into the second similarity relationship vector V by the expansion function Flatten( ) ₂ .

A4: Input the first similarity relationship vector V ₁ into the first weight autoencoder in the weight autoencoder, after obtaining the weight WAE ₁ (V ₁ ) corresponding to the first similarity relationship vector, according to the above formula (8 ) to obtain the first similarity weight Alpha. Input the second similarity relationship vector V ₂ into the second weight autoencoder in the weight autoencoder, and after obtaining the weight WAE ₂ (V ₂ ) corresponding to the second similarity relationship vector, calculate according to the above formula (9) Obtain the second similarity weight Beta.

Exemplarily, FIG. 3 shows a schematic diagram of a matching process between an image to be recognized and a reference image in image recognition. For each image to be recognized Q _i , when performing image recognition, the image to be recognized Q _i and each reference image R _i are sequentially formed into a pair of images to be processed, and the image to be processed is passed through the above steps S101 to S105 The process of determining the target similarity between the image to be recognized and the reference image in the image to be processed; for example, when the preset number of block images segmented from the image to be processed is 4, the target similarity

For n images to be recognized and n reference images, n*n images to be processed can be composed, and n*n target similarities corresponding to n*n images to be processed can be combined into a similarity matrix as shown in FIG. 3 . In the target similarity of a row corresponding to each image to be recognized Q _i in the similarity matrix, the reference image corresponding to the column where the maximum target similarity is located is the best matching item of the image to be recognized, and the best matching item The corresponding entity information may be used as identification information of the image to be identified.

In one embodiment, the above-mentioned image recognition process can be realized by processing an object model. The target model can include the above-mentioned AlexNet, dimensionality reduction autoencoder and weight autoencoder. Before the above step S101, the AlexNet, the dimensionality reduction autoencoder and the weight autoencoder in the target model can be jointly trained based on the sample images to obtain the trained target model. Afterwards, after obtaining a pair of images to be processed including the image to be recognized and the reference image and performing image segmentation to obtain block images, the block images are input into the trained target model, and the initial feature information can be extracted through the trained AlexNet. And the dimensionality reduction processing is performed on the initial feature information through the trained dimensionality reduction encoder to obtain the block feature information corresponding to the block image. Afterwards, the position information and block feature information of each block image, as well as the trained weight encoder, determine the similarity weight between the image to be recognized and the reference image. Based on the similarity weight, the target similarity between the image to be recognized and the reference image can be determined, and then the recognition result of the image to be recognized can be determined.

In one embodiment, the above-mentioned target model can be trained based on triplets. The triplet-based training method specifically uses triplet sample images as training sample images input to the target model. The triplet sample image includes an anchor point image (anchor), a positive sample image (pos) and a negative sample image (neg). Among them, the anchor image is the reference target image in the image similarity calculation process; the positive sample image refers to the entity information represented by the anchor image is the same (for example, an image corresponding to the same location) but the environmental conditions (including lighting, etc.) images with different appearance conditions, shooting angles, etc.); negative sample images refer to images with different entity information represented by the anchor image (for example, images corresponding to different locations). The schematic diagram of the training process of the target model is shown in Figure 4, and the details are as follows:

B1: Screening of triplet sample images. Firstly, suitable triplet sample images are screened from the image library as training samples for the target model. Exemplarily, it is assumed that the similarity between the positive sample image and the anchor point image (referred to as the positive sample similarity for short) is S _pos , and the similarity between the negative sample image and the anchor point image (referred to as the negative sample similarity for short) is S _neg , in the filtered triplet sample images, the similarity of positive samples and the similarity of negative samples need to meet the two constraints shown in formula (11) and formula (12):

Formula (11): S _pos >S _neg

Formula (12): S _pos -margin<S _neg

Formula (11) indicates that among the triplet sample images, the similarity of the positive samples corresponding to the positive sample images with the same entity information represented by the anchor image should be greater than the similarity of the negative samples corresponding to the negative sample images. In formula (12), margin is a preset value in advance. Formula (12) indicates that in triplet sample images, the difference between positive sample similarity and negative sample similarity should not be too large. Through this constraint, we can Avoid the target model learning a simple triplet image sample (such as a positive sample similarity is 1 or almost 1, a negative sample similarity is 0 or almost 0 triplet image samples), resulting in the target model The training effect is poor or even overfitting and model collapse occur.

B2: Segment the filtered triplet sample images and input them into the target model, and perform feature extraction processing through AlexNet of the target model to obtain the initial feature information corresponding to the block images of each sample image.

B2: Each initial feature information is sequentially input into the adaptive average pooling layer AAP of the target model for average pooling processing and expanded by the expansion function Flatten( ) to obtain the pooled feature f.

B3: Each pooling feature f is input to the dimensionality reduction autoencoder for dimensionality reduction processing, the feature information output by the encoder part of the dimensionality reduction autoencoder is used as the block feature information w, and the dimensionality reduction autoencoder The feature information output by the decoder part is used as the decoding feature information z.

B4: Based on the block feature information of each block image corresponding to the positive sample image and the anchor point image and the weight autoencoder of the target model, determine the similarity weight between the positive sample image and the anchor point image, and based on the similarity degree weight, and calculate the positive sample similarity S _pos between the positive sample image and the anchor image.

B5: Based on the block feature information of each block image corresponding to the negative sample image and the anchor point image and the weight autoencoder of the target model, determine the similarity weight between the negative sample image and the anchor point image, and based on the similarity The degree weight is calculated to obtain the negative sample similarity S _neg between the negative sample image and the anchor image.

B6: Based on the pooling feature f of the input dimensionality reduction autoencoder in step B3 and the decoding feature information z output by the decoder part of the dimensionality reduction autoencoder, the dimensionality reduction encoding is obtained by calculating the preset formula (13) as a loss function The mean square error value L _Mse corresponding to the device. This formula (13) is as follows:

Among them, f _an and z _an represent the pooling feature and decoding feature information corresponding to the anchor image respectively; f _pos and z _pos represent the pooling feature and decoding feature information corresponding to the positive sample image respectively; f _neg and z _neg represent the negative The pooling feature and decoding feature information corresponding to the sample image; || || ₂ indicates the two-norm operator. When the mean square error value L _Mse calculated by formula (13) is smaller, it means that the decoding feature information output by the decoder part of the dimensionality reduction autoencoder is closer to the input pooling feature, indicating that the dimensionality reduction autoencoder's encoding The block feature information obtained by encoder encoding can more accurately represent the features of the image.

B7: Based on the positive sample similarity S _pos calculated in step B4 and the negative sample similarity S _neg calculated in step B5, the triplet network loss value L _Triplet is calculated by using the preset formula (14) as a loss function. This formula (14) is as follows:

In formula (14), M is the number of triplet sample images; ln is the natural logarithm, e is the natural base, and margin and temper are hyperparameters set in advance. When the triplet network loss value L _Triplet calculated by the formula (14) is smaller, it means that the similarity of the current positive sample is closer to 1, and the similarity of the negative sample is closer to 0, so as to realize the effective distinction between positive and negative samples.

B8: According to the mean square error value L _Mse obtained in step B6 and the triplet network loss value L _Triplet obtained in step B7, calculate the total loss function value L _total of the target model by formula (15). This formula (15) is as follows:

L _total ＝λ ₁ L _Mse +λ ₂ L _Triplet

In formula (15), λ ₁ and λ ₂ are used as hyperparameters of the target model, and their actual values are set in advance according to experience.

B9: By calculating the total loss function value L _total of the target model, iteratively update the network parameters of each neural network such as AlexNet, dimensionality reduction autoencoder, and weight autoencoder in the target model, and continue to target the target through the above steps B2 to B9 The model is trained until the final calculated total loss function value L _total is less than the preset loss value, then the training is stopped to obtain the trained target model.

Through the above training steps, the training of the target model can be accurately completed, and the trained target model can be obtained, so that the subsequent training can be based on the trained AlexNet, dimensionality reduction encoder and weight encoder in the trained target model. Realize image recognition and improve the accuracy of image recognition.

It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Embodiment two:

Figure 5 shows a schematic structural diagram of an image recognition device provided by the embodiment of the present application. For the convenience of description, only the parts related to the embodiment of the present application are shown:

The image recognition device includes: a segmentation unit 51 , a feature extraction unit 52 , a similarity weight determination unit 53 , a target similarity determination unit 54 and a recognition result determination unit 55 . in:

A segmentation unit 51, configured to divide the image to be processed into a preset number of block images; wherein, the image to be processed includes an image to be identified and a reference image, and the block image corresponding to the image to be identified is a block image to be identified For a block image, the block image corresponding to the reference image is a reference block image; each of the block images to be identified has a one-to-one correspondence of position information with the reference block image.

The feature extraction unit 52 is configured to perform feature extraction processing on each of the block images to obtain block feature information corresponding to each of the block images.

The similarity weight determination unit 53 is configured to determine the similarity weight between the image to be recognized and the reference image according to the position information of each of the block images and the corresponding block feature information.

The target similarity determination unit 54 is configured to determine the target similarity between the image to be recognized and the reference image according to each of the block feature information and the similarity weight.

The recognition result determination unit 55 is configured to determine the recognition result of the image to be recognized according to the target similarity.

Optionally, the feature extraction unit includes:

The initial feature information determination module is used for, for each of the block images, inputting the block images into a trained convolutional neural network for processing to obtain initial feature information corresponding to the block images;

A dimensionality reduction module, configured to perform dimensionality reduction processing on the initial feature information to obtain block feature information corresponding to the block image.

Optionally, the dimensionality reduction module is specifically configured to input the initial feature information into the adaptive average pooling layer and/or the trained dimensionality reduction autoencoder for dimensionality reduction processing, to obtain the corresponding Block feature information.

Optionally, the image recognition device also includes:

A dimensionality reduction autoencoder training unit, configured to obtain preset sample feature information, and input the preset sample feature information into the dimensionality reduction autoencoder to be trained; adjust the parameters of the dimensionality reduction autoencoder to be trained, In order to make the mean square error between the preset sample feature information and the decoding feature information output by the decoder of the dimensionality reduction autoencoder smaller than a preset threshold, a trained dimensionality reduction autoencoder is obtained.

Optionally, the similarity weight determination unit includes:

A similarity feature vector determination module, configured to determine a paired similarity feature vector and an unpaired similarity feature vector according to the position information of each of the block images and the corresponding block feature information; wherein, the paired A pair of similarity feature vectors includes similarity information between the block image to be identified corresponding to the position information and the reference block image; the unpaired similarity feature vector includes the block image not corresponding to the position information Similarity information between block images;

A similarity weight determining module, configured to determine the similarity weight according to the paired similarity feature vector and the non-paired similarity feature vector.

Optionally, the similarity weight determination module is specifically configured to determine the similarity weight according to the paired similarity feature vector, the non-paired similarity feature vector, and a preset weight autoencoder.

Correspondingly, the similarity weight includes a first similarity weight and a second similarity weight, and the similarity weight determination module is specifically configured to and the preset first weight autoencoder to determine the first similarity weight; according to the paired similarity feature vector, the second similarity feature vector and the preset second weight autoencoder, determine the second similarity weight.

Optionally, the target similarity determination unit is specifically configured to determine an initial similarity between the image to be recognized and the reference image according to each of the block feature information; combine the initial similarity with the The similarity weight is multiplied to obtain the target similarity between the image to be recognized and the reference image.

It should be noted that the information interaction and execution process between the above-mentioned devices/units are based on the same concept as the method embodiment of the present application, and its specific functions and technical effects can be found in the method embodiment section. I won't repeat them here.

Embodiment three:

Fig. 6 is a schematic diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 6 , the electronic device 6 of this embodiment includes: a processor 60 , a memory 61 , and a computer program 62 stored in the memory 61 and operable on the processor 60 , such as an image recognition program. When the processor 60 executes the computer program 62, the steps in the above-mentioned various image method embodiments are implemented, for example, steps S101 to S105 shown in FIG. 1 . Alternatively, when the processor 60 executes the computer program 62, it realizes the functions of each module/unit in the above-mentioned device embodiments, such as the functions of the segmentation unit 51 to the recognition result determination unit 55 shown in FIG. 5 .

Exemplarily, the computer program 62 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 60 to complete this application. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 62 in the electronic device 6 .

The electronic device 6 may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers. The electronic device may include, but not limited to, a processor 60 and a memory 61 . Those skilled in the art can understand that FIG. 6 is only an example of the electronic device 6, and does not constitute a limitation to the electronic device 6. It may include more or less components than those shown in the illustration, or combine some components, or different components. , for example, the electronic device may also include an input and output device, a network access device, a bus, and the like.

The so-called processor 60 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The storage 61 may be an internal storage unit of the electronic device 6 , such as a hard disk or memory of the electronic device 6 . The memory 61 can also be an external storage device of the electronic device 6, such as a plug-in hard disk equipped on the electronic device 6, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 61 may also include both an internal storage unit of the electronic device 6 and an external storage device. The memory 61 is used to store the computer program and other programs and data required by the electronic device. The memory 61 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.

The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions described in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application, and should be included in the Within the protection scope of this application.

Claims

An image recognition method, characterized in that, comprising:

Dividing the image to be processed into a preset number of block images; wherein, the image to be processed includes an image to be identified and a reference image, the block image corresponding to the image to be identified is a block image to be identified, the The block image corresponding to the reference image is a reference block image; each of the block images to be identified has a one-to-one corresponding position information of the reference block image;

performing feature extraction processing on each of the block images respectively, to obtain block feature information corresponding to each of the block images;

determining a similarity weight between the image to be identified and the reference image according to the position information of each of the block images and the corresponding block feature information;

determining the target similarity between the image to be recognized and the reference image according to each of the block feature information and the similarity weight;

A recognition result of the image to be recognized is determined according to the target similarity.
The image recognition method according to claim 1, wherein the feature extraction process is performed on each of the block images to obtain block feature information corresponding to each of the block images, including:

For each of the block images, input the block images into a trained convolutional neural network for processing to obtain initial feature information corresponding to the block images;

Perform dimensionality reduction processing on the initial feature information to obtain block feature information corresponding to the block image.
The image recognition method according to claim 2, wherein the dimensionality reduction processing of the initial feature information to obtain the block feature information corresponding to the block image includes:

Inputting the initial feature information into an adaptive average pooling layer and/or a trained dimensionality reduction autoencoder for dimensionality reduction processing to obtain block feature information corresponding to the block image.
The image recognition method according to claim 3, wherein, for each of the block images, the block images are input into a trained convolutional neural network for processing to obtain the block images Before the corresponding initial feature information, it also includes:

Obtaining preset sample feature information, and inputting the preset sample feature information into a dimensionality reduction autoencoder to be trained;

Adjusting the parameters of the dimensionality reduction autoencoder to be trained, so that the mean square error between the preset sample feature information and the decoding feature information output by the decoder of the dimensionality reduction autoencoder is less than a preset threshold, Get the trained dimensionality-reduced autoencoder.
The image recognition method according to claim 1, wherein the distance between the image to be recognized and the reference image is determined according to the position information of each of the block images and the corresponding block feature information. The similarity weight of , including:

According to the position information of each of the block images and the corresponding block feature information, a paired similarity feature vector and an unpaired similarity feature vector are determined; wherein, the paired similarity feature vector includes position information relative The corresponding similarity information between the block image to be identified and the reference block image; the unpaired similarity feature vector includes similarity information between the block images whose position information does not correspond;

The similarity weight is determined according to the paired similarity feature vector and the non-paired similarity feature vector.
The image recognition method according to claim 5, wherein said determining said similarity weight according to said paired similarity feature vector and said non-paired similarity feature vector comprises:

The similarity weight is determined according to the paired similarity feature vector, the non-paired similarity feature vector, and a preset weight autoencoder.
The image recognition method according to claim 6, wherein the non-paired similarity feature vectors include a first similarity feature vector and a second similarity feature vector, and the first similarity feature vector contains position information The similarity information between the block image to be identified and the reference block image that does not correspond, the second similarity feature vector includes similarity information between different block images in the same image to be processed;

Correspondingly, the determining the similarity weight according to the paired similarity feature vector, the non-paired similarity feature vector and the preset weight autoencoder includes:

Determine a first similarity weight according to the paired similarity feature vector, the first similarity feature vector, and a preset first weight autoencoder;

The second similarity weight is determined according to the paired similarity feature vector, the second similarity feature vector, and a preset second weight autoencoder.
The image recognition method according to claim 1, wherein the target similarity between the image to be recognized and the reference image is determined according to each of the block feature information and the similarity weight, include:

determining an initial similarity between the image to be recognized and the reference image according to each of the block feature information;

The initial similarity is multiplied by the similarity weight to obtain a target similarity between the image to be recognized and the reference image.
The image recognition method according to any one of claims 1 to 8, wherein the image to be recognized is a scene image to be recognized, and the reference image is a reference scene image; the recognition result of the image to be recognized includes the Describe the visual position recognition results of the scene image to be recognized.
An image recognition device, characterized in that it comprises:

A segmentation unit, configured to divide the image to be processed into a preset number of block images; wherein, the image to be processed includes an image to be recognized and a reference image, and the block image corresponding to the image to be recognized is a block image to be recognized A block image, the block image corresponding to the reference image is a reference block image; each of the block images to be identified has a one-to-one corresponding position information of the reference block image;

A feature extraction unit, configured to perform feature extraction processing on each of the block images to obtain block feature information corresponding to each of the block images;

A similarity weight determining unit, configured to determine a similarity weight between the image to be recognized and the reference image according to the position information of each of the block images and the corresponding block feature information;

A target similarity determining unit, configured to determine the target similarity between the image to be recognized and the reference image according to each of the block feature information and the similarity weight;

The recognition result determination unit is configured to determine the recognition result of the image to be recognized according to the target similarity.
The image recognition device according to claim 10, wherein the feature extraction unit comprises:

The initial feature information determination module is used for, for each of the block images, inputting the block images into a trained convolutional neural network for processing to obtain initial feature information corresponding to the block images;

A dimensionality reduction module, configured to perform dimensionality reduction processing on the initial feature information to obtain block feature information corresponding to the block image.
The image recognition device according to claim 11, wherein the dimensionality reduction module is specifically configured to input the initial feature information into an adaptive average pooling layer and/or a trained dimensionality reduction autoencoder for reducing Dimensional processing to obtain block feature information corresponding to the block image.
The image recognition device according to claim 12, wherein the image recognition device further comprises:

A dimensionality reduction autoencoder training unit, configured to obtain preset sample feature information, and input the preset sample feature information into the dimensionality reduction autoencoder to be trained; adjust the parameters of the dimensionality reduction autoencoder to be trained, In order to make the mean square error between the preset sample feature information and the decoding feature information output by the decoder of the dimensionality reduction autoencoder smaller than a preset threshold, a trained dimensionality reduction autoencoder is obtained.
The image recognition device according to claim 10, wherein the similarity weight determining unit comprises:

A similarity feature vector determination module, configured to determine a paired similarity feature vector and an unpaired similarity feature vector according to the position information of each of the block images and the corresponding block feature information; wherein, the paired A pair of similarity feature vectors includes similarity information between the block image to be identified corresponding to the position information and the reference block image; the unpaired similarity feature vector includes the block image not corresponding to the position information Similarity information between block images;

A similarity relationship vector determining module is used to determine the similarity weight according to the paired similarity feature vector and the non-paired similarity feature vector.
The image recognition device according to claim 14, wherein the similarity weight determination module is specifically configured to, according to the paired similarity feature vector, the non-paired similarity feature vector and the preset weight An autoencoder that determines the similarity weights.
The image recognition device according to claim 15, wherein the non-paired similarity feature vectors include a first similarity feature vector and a second similarity feature vector, and the first similarity feature vector contains position information The similarity information between the block image to be identified and the reference block image that does not correspond, the second similarity feature vector includes similarity information between different block images in the same image to be processed; corresponding land:

Correspondingly, the similarity weight includes a first similarity weight and a second similarity weight, and the similarity weight determination module is specifically configured to and the preset first weight autoencoder to determine the first similarity weight; according to the paired similarity feature vector, the second similarity feature vector and the preset second weight autoencoder, determine the second similarity weight.
The image recognition device according to claim 10, wherein the target similarity determination unit is specifically configured to determine the initial distance between the image to be recognized and the reference image according to each of the block feature information. Similarity: multiplying the initial similarity by the similarity weight to obtain a target similarity between the image to be recognized and the reference image.
The image recognition device according to any one of claims 10 to 17, wherein the image to be recognized is a scene image to be recognized, and the reference image is a reference scene image; the recognition result of the image to be recognized includes the Describe the visual position recognition results of the scene image to be recognized.
An electronic device, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein when the processor executes the computer program, the electronic device realizes The steps of the method according to any one of claims 1 to 9.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the electronic device realizes any one of claims 1 to 9. method steps.