WO2024077785A1

WO2024077785A1 - Image recognition method and apparatus based on convolutional neural network model, and terminal device

Info

Publication number: WO2024077785A1
Application number: PCT/CN2022/142412
Authority: WO
Inventors: 张号逵; 杨涛; 胡文泽; 王孝宇
Original assignee: 深圳云天励飞技术股份有限公司
Priority date: 2022-10-13
Filing date: 2022-12-27
Publication date: 2024-04-18
Also published as: CN115690488A

Abstract

The present application is applicable to the technical field of image recognition. Provided are an image recognition method and apparatus based on a convolutional neural network model, and a terminal device. The convolutional neural network model performs, on the basis of fast Fourier transform, frequency-domain global convolution on an image to be subjected to recognition. The image recognition method comprises: inputting an image to be subjected to recognition into a trained convolutional neural network model, and sequentially performing feature extraction and recognition on said image by means of the convolutional neural network model, so as to obtain a recognition result. The present application can reduce the calculation amount of the convolutional neural network model during global feature extraction, thereby improving the model efficiency.

Description

Image recognition method, device and terminal equipment based on convolutional neural network model

This application claims priority to the Chinese patent application filed with the China Patent Office on October 13, 2022, with application number 202211255315.1 and invention name “Image recognition method, device and terminal device based on convolutional neural network model”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application belongs to the field of image recognition technology, and in particular, relates to an image recognition method, apparatus, terminal device, and computer-readable storage medium based on a convolutional neural network model.

Background technique

Feature extraction and matching is an important task in many computer vision applications and is widely used in image recognition tasks such as image retrieval and target detection. When extracting features from an image, image features include global features and local features. Global features refer to the overall properties of the image, while local features refer to features extracted from local areas of the image.

In the prior art, convolutional neural networks are widely used to extract global features of images because convolution operations have good hardware support. However, convolutional neural networks cannot capture global information at one time, and multiple convolutional layers need to be superimposed to increase the receptive field, which increases the number of model parameters and the amount of calculation.

Summary of the invention

The embodiments of the present application provide an image recognition method, apparatus, and terminal device based on a convolutional neural network model, which can reduce the amount of computation required when the convolutional neural network model performs global feature extraction, thereby improving model efficiency.

In a first aspect, an embodiment of the present application provides an image recognition method based on a convolutional neural network model, wherein the convolutional neural network model performs frequency domain global convolution on an image to be recognized based on a fast Fourier transform, and the image recognition method comprises:

The image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model is used to extract features and identify the image to be identified in turn to obtain a recognition result.

In a second aspect, an embodiment of the present application provides a convolutional neural network model training method, comprising:

Obtaining the constructed convolutional neural network model, and inputting the sample image into the convolutional neural network for training until the convolutional neural network meets the preset requirements, thereby obtaining a convolutional neural network model;

Among them, the above convolutional neural network performs frequency domain global convolution on the sample image based on fast Fourier transform.

In a third aspect, an embodiment of the present application provides an image recognition device, including:

An input module and a trained convolutional neural network model, wherein the convolutional neural network model performs frequency domain global convolution on the image to be identified based on fast Fourier transform;

The above-mentioned input module is used to input the image to be recognized into the above-mentioned convolutional neural network model;

The above-mentioned convolutional neural network model is used to extract features and recognize the above-mentioned images to be recognized in sequence to obtain recognition results.

In a fourth aspect, an embodiment of the present application provides a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the steps of the image recognition method based on the convolutional neural network model described in the first aspect or the convolutional neural network model training method described in the second aspect are implemented.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a processor, it implements the steps of the image recognition method based on the convolutional neural network model described in the first aspect or the convolutional neural network model training method described in the second aspect.

In the sixth aspect, an embodiment of the present application provides a computer program product. When the computer program product is run on a terminal device, the terminal device executes the image recognition method based on the convolutional neural network model described in any one of the first aspect or the convolutional neural network model training method described in the second aspect.

Compared with the prior art, the embodiments of the present invention have the following beneficial effects:

In the embodiment of the present application, the image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model sequentially extracts features and identifies the image to be identified to obtain the identification result. Since the global convolution in the frequency domain is performed on the image to be identified based on the fast Fourier transform, the convolution operation in the spatial domain is converted into a multiplication operation in the frequency domain, thus reducing the amount of calculation when the convolutional neural network extracts global features, improving the recognition efficiency of the convolutional neural network model, and facilitating the deployment of applications on devices with lower computing power.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the embodiments or the description of the prior art are briefly introduced below.

FIG1 is a schematic diagram of a flow chart of an image recognition method based on a convolutional neural network model provided by an embodiment of the present application;

FIG2 is a schematic diagram of the structure of a convolutional neural network model provided in an embodiment of the present application;

FIG3 is a schematic diagram of the structure of a second convolution module provided in an embodiment of the present application;

FIG4 is a flow chart of a convolutional neural network model training method provided in an embodiment of the present application;

FIG5 is a schematic diagram of the structure of an image recognition device provided in an embodiment of the present application;

FIG6 is a schematic diagram of the structure of a convolutional neural network model training device provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of the structure of a terminal device provided in an embodiment of the present application.

Detailed ways

In the following description, specific details such as specific system structures, technologies, etc. are provided for the purpose of illustration rather than limitation, so as to provide a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application may also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to prevent unnecessary details from obstructing the description of the present application.

It should be understood that when used in the present specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or combinations thereof.

It should also be understood that the term “and/or” used in the specification and appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In addition, in the description of the present application specification and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the descriptions and cannot be understood as indicating or implying relative importance.

References to "one embodiment" or "some embodiments" etc. described in the specification of the present application mean that the specific features, structures or characteristics described in conjunction with the embodiment are included in one or more embodiments of the present application. Therefore, the phrases "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in the specification do not necessarily refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized in other ways.

Embodiment 1:

FIG1 shows a schematic flow chart of an image recognition method based on a convolutional neural network model provided by an embodiment of the present invention, which is described in detail as follows:

Among them, the above convolutional neural network model performs frequency domain global convolution processing on the image to be recognized based on fast Fourier transform.

Specifically, when extracting global features of an image, a convolutional neural network usually needs to stack multiple convolutional layers to increase the receptive field, so as to extract global features through a large receptive field. However, the number of parameters and the amount of calculation of the convolutional neural network model will also increase accordingly, making the computational complexity of the convolutional neural network model too large. Moreover, as the size of the image to be identified increases, when the size of the image to be identified is large (such as 112*112), the computational complexity of the convolutional neural network model quickly exceeds the 7*7 convolution, which is not convenient for practical application. Therefore, when the convolutional neural network model extracts features from the input image to be identified, the image to be identified is processed by fast Fourier transform, and the image to be identified is converted into the image to be identified in the frequency domain, so that the convolution operation in the spatial domain is converted into the multiplication operation in the frequency domain, thereby reducing the amount of calculation in the global feature extraction process.

In an embodiment of the present application, the image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model is used to extract features and identify the image to be identified in turn to obtain an identification result. Since the global convolution in the frequency domain is performed on the image to be identified based on the fast Fourier transform, the convolution operation in the spatial domain is converted into a multiplication operation in the frequency domain. Therefore, when extracting global features of the image to be identified, the amount of calculation in the process of extracting global features of large-size images by the convolutional neural network model can be effectively reduced, thereby improving the recognition efficiency of the convolutional neural network model and facilitating deployment and application on devices with lower computing power.

In some embodiments, the above-mentioned image recognition method based on the convolutional neural network model further includes:

Get the image to be recognized.

Optionally, the image to be identified may be an image captured by a camera device, or may be an image frame in a video stream captured by a camera device.

Optionally, since different image recognition tasks may require different images to be recognized, the adopted camera equipment, the rules for collecting the images to be detected, etc. may also be different, therefore, the corresponding images to be recognized are acquired according to the corresponding acquisition methods and acquisition rules of each application field, etc. For example, for a face recognition task, it is necessary to collect a face image as the image to be recognized and recognize the facial features in the face image.

In the embodiments of the present application, according to the images required for the image recognition tasks in various application fields, corresponding acquisition methods and acquisition rules are adopted to obtain the images to be recognized that meet the requirements of the image recognition tasks, so as to perform the image recognition tasks.

In some embodiments, the convolutional neural network model includes a feature extraction module and a recognition module. The steps of extracting features and recognizing the image to be recognized in sequence through the convolutional neural network model to obtain a recognition result include:

A1. Extracting features of the image to be identified by the feature extraction module;

A2. Based on the above recognition module, the extracted features are recognized to obtain recognition results.

Optionally, since image recognition includes different tasks such as image classification and target detection, different image recognition tasks use different recognition methods for the same feature when performing image recognition. Therefore, the feature extraction module is used to extract features of the input image to be recognized, and the extracted features are used as inputs of the recognition module. According to the image recognition task, corresponding recognition is performed to obtain a recognition result. The recognition module may include one or more recognition units, and different recognition units perform different recognition tasks. For example, it may include a pedestrian detection unit and a target detection unit. The pedestrian detection unit performs a pedestrian detection task on the extracted features, or the extracted feature maps are input into the pedestrian detection unit and the target detection unit to perform pedestrian detection and target detection tasks.

In the embodiment of the present application, features are extracted from the image to be identified through a feature extraction module, and the extracted features are obtained through a recognition module for corresponding identification to obtain corresponding identification results, so as to improve the recognition efficiency of each image recognition task.

In some embodiments, the feature extraction module includes a first convolution module and a second convolution module, and step A1 includes:

A11. Extracting local features of the image to be identified based on the first convolution module to obtain a local feature map;

A12. Based on the second convolution module, use fast Fourier transform to perform frequency domain global convolution on the local feature map to obtain a global feature map.

Optionally, the above-mentioned convolutional neural network model can be constructed based on an existing convolutional neural network, with the shallow convolution layer as the first convolution module, and the deep convolution layer or self-attention replaced by the second convolution module, the above-mentioned first convolution module uses ordinary convolution to perform convolution processing on the image to be identified, extracts the local features of the image to be identified, outputs the local feature map of the image to be identified, and uses the local feature map as the input of the second convolution module, the above-mentioned second module obtains the local feature map in the frequency domain by fast Fourier transform processing on the local feature map, and extracts the global features based on the local feature map in the frequency domain to obtain the global feature map. For example, in the convolutional neural network model shown in Figure 2, the first three convolution layers are the first convolution modules of ordinary convolution, the three deeper convolution layers are the second convolution modules, the second convolution module is connected to the recognition module, the image to be identified is used as the input of the first convolution module for local feature extraction, and the output features are input to the second convolution module, the second convolution module extracts global features, and the global feature map output by the above-mentioned second convolution module is used as the input of the recognition module for recognition, thereby outputting the corresponding recognition result.

It should be noted that the first convolution module and the second convolution module in the above-mentioned convolutional neural network model can also adopt a cross-appearing structure, that is, the first convolution module is connected to the second convolution module, and the output of the second convolution module is connected to another first convolution module. The structure shown in Figure 2 can also be stacked multiple times, so that the feature map output by the feature extraction module is the global feature map extracted by the second convolution module (that is, the recognition module is based on the global features extracted by the second convolution module for recognition), and the specific structure of the first convolution module (ordinary convolution layer) in the convolutional neural network model and the second convolution module provided in the embodiment of the present application is not limited.

In the embodiment of the present application, the local features of the image to be identified are extracted by the first convolution module of the convolutional neural network model, and the obtained local feature map is used as the input of the second convolution module, the local feature map is subjected to fast Fourier transform processing, and then the global features are extracted based on the obtained local feature map in the frequency domain. Therefore, the amount of calculation in the global feature extraction process is reduced, and the obtained global features contain both local features and global features, thereby improving the recognition accuracy of the convolutional neural network model.

In some embodiments, the second convolution module includes a first branch and a second branch, and the step A12 includes:

The above-mentioned local feature map is split along the channel direction to obtain a first local feature map and a second local feature map, and the above-mentioned first local feature map and the above-mentioned second local feature map are respectively input into the above-mentioned first branch and the above-mentioned second branch.

Optionally, when the second convolution module performs global feature extraction on the input local feature map, the local feature map is first split along the channel direction (such as evenly split into two parts along the channel direction) to obtain a first local feature map and a second local feature map, and the first local feature map is input into the first branch, and the second local feature map is input into the second branch, so as to extract global features from the first local feature map and the second local feature map, respectively.

The first branch and the second branch respectively use fast Fourier transform to perform frequency domain global convolution on the input local feature map to obtain a first global feature map and a second global feature map.

Optionally, the first branch performs fast Fourier transform processing on the first local feature map to obtain a first local feature map in the frequency domain, and then performs global convolution processing on the first local feature map in the frequency domain to obtain a first global feature map; the second branch performs fast Fourier transform processing on the second local feature map to obtain a second local feature map in the frequency domain, and then performs global convolution processing on the second local feature map in the frequency domain to obtain a second global feature map.

The first global feature map and the second global feature map are concatenated along the channel direction to obtain a global feature map.

Optionally, since the first local feature map and the second local feature map are obtained after the local feature map is split along the channel direction, after the first branch and the second branch respectively extract the features of the first local feature map and the second local feature map, the first global feature map and the second global feature map are spliced along the channel direction to obtain a complete global feature map, so as to perform subsequent processing according to the complete global feature map of the input image to be identified.

In the embodiment of the present application, since the local feature map is split into two parts along the channel direction and input into the first branch and the second branch, the number of channels of the local feature map is halved, and in the process of global feature extraction, the local feature map is globally convolved in the frequency domain based on the fast Fourier transform. Therefore, the global features are extracted based on the local feature map with the halved channel number based on the fast Fourier transform, which reduces the computational complexity of the global feature extraction, thereby reducing the requirements on the computing power of the device, and facilitating the deployment of applications on devices with lower computing power.

In some embodiments, when the first branch and the second branch perform frequency domain global convolution according to the input local feature map, the method includes:

The first branch performs fast Fourier transform processing on the first local feature map and the corresponding weight matrix based on the column dimension to obtain the first local feature map and the weight matrix in the frequency domain;

Multiplying the first local feature map in the frequency domain and the weight matrix point by point to obtain a first frequency domain feature map;

Perform inverse fast Fourier transform processing on the first feature map in the frequency domain to obtain a first global feature map.

Optionally, in the process of extracting global features from the first local feature map by the first branch, the first local feature map and its weight matrix are fast Fourier transformed along the row dimension to convert them into a first local feature map and a weight matrix in the frequency domain, so that when the weight matrix is used to perform global convolution on the first local feature map, the first local feature map in the frequency domain and the weight matrix are multiplied point by point, that is, the convolution operation in the spatial domain is converted into a multiplication operation in the frequency domain, thereby obtaining a first frequency domain feature map (global feature), and then the first frequency domain feature map is inversely fast Fourier transformed to obtain a first global feature map in the spatial domain, so that the amount of calculation is effectively reduced when a large convolution kernel is used to extract global features.

The second branch performs fast Fourier transform processing on the second local feature map and the corresponding weight matrix based on the row dimension to obtain the second local feature map and the weight matrix in the frequency domain;

Multiplying the second local feature map in the frequency domain and the weight matrix point by point to obtain a second frequency domain feature map;

The second feature map in the frequency domain is processed by inverse fast Fourier transform to obtain a second global feature map.

Optionally, in the process of extracting global features from the second local feature map by the second branch, the second local feature map and its weight matrix are fast Fourier transformed along the row dimension to convert them into a second local feature map and a weight matrix in the frequency domain, so that when the weight matrix is used to perform global convolution on the second local feature map, the second local feature map and the weight matrix in the frequency domain are multiplied point by point, that is, the convolution operation in the spatial domain is converted into a multiplication operation in the frequency domain, thereby obtaining a second frequency domain feature map (global feature), and then the second frequency domain feature map is inversely fast Fourier transformed to obtain a second global feature map in the spatial domain, so that the amount of calculation is effectively reduced when a large convolution kernel is used to extract global features.

Among them, the computational complexity of the second convolution module in the process of extracting global features can be expressed as:

That is, the computational complexity of the first branch and the second branch performing global convolution on the local feature map based on fast Fourier transform is O(CHW(log ₂ [H]+log ₂ [W]).

Optionally, in order to reduce the computational complexity of the convolutional neural network model, when performing a fast Fourier transform on the first local feature map and its weight matrix along the row dimension, a one-dimensional fast Fourier transform is performed on the above-mentioned first local feature map and its weight matrix to obtain the first local feature map and its weight matrix in a one-dimensional form (such as an array). For example, the first local feature map is represented as (C, H, W), and a one-dimensional fast Fourier transform is performed on the above-mentioned first local feature map along the column dimension to obtain a first local feature map in a numerical form with H elements.

In the embodiment of the present application, since the local feature map and its weight matrix in the spatial domain are converted into the frequency domain form based on the fast Fourier transform to extract the global features, the convolution operation in the spatial domain is converted into a simple multiplication operation, thereby reducing the computational complexity of the convolutional neural network model. At the same time, since the local feature map is processed by fast Fourier transform from different dimensions and the global features are extracted, global features of different dimensions are obtained, and then the obtained global features are spliced to obtain global features including width and height directions, which reduces the computational complexity when extracting the global features of the local feature map, and facilitates the deployment and application of the convolutional neural network model on devices with lower computing power.

In some embodiments, the second convolution module further includes a position embedding module, and the above steps further include:

The position embedding module is used to perform feature extraction on the local feature map to obtain a position feature map, and the position feature map is added to the local feature map to obtain a local feature map containing position features.

Optionally, before performing global feature extraction on the local feature map, convolution processing is performed on the local feature map through a position embedding module to extract the position features in the local feature map, generate a position feature map, and add the position feature map to the local feature map according to the pixel position to obtain a local feature map embedded with the position features.

Optionally, the position embedding module is a two-dimensional convolution module, which performs convolution processing on the input image to be identified to generate a two-dimensional position feature map, that is, the size of the position feature map is made consistent with the resolution size of the input local feature map, so as to directly add the position feature map to the local feature map to obtain a local feature map containing position features. For example, the position embedding module adopts a two-layer lightweight convolution network structure, that is, a simple structure of "convolution + normalization processing + activation function + convolution". Since it is necessary to generate a two-dimensional position feature map, the convolution layer can use a 3×3 depth-separable convolution to perform convolution processing on the local feature map to generate the position feature map.

In an embodiment of the present application, since the position features of an image can enhance the ability to describe and distinguish the image content, before extracting the global features based on the local feature map, the position features of the local feature map are extracted based on the weight matrix of the local feature map, and the position features are embedded into the local feature map so that the local feature map contains the position features. Therefore, the recognition accuracy can be improved when image recognition is subsequently performed based on the global feature map containing the position features.

In some embodiments, the network structure of the second convolution module is shown in FIG3, and may include a position embedding module, a first branch, and a second branch. After the position feature of the local feature map is embedded in the local feature map by the position embedding module, the local feature map embedded with the position feature is split to obtain the first local feature map and the second local feature map and input them into the first branch and the second branch respectively. The first branch and the second branch perform a one-dimensional fast Fourier transform on the input local feature map and its corresponding weight matrix based on the row dimension and the column dimension respectively to obtain a one-dimensional local feature map and weight matrix in the frequency domain, and then multiply the local feature map in the frequency domain with its corresponding weight matrix point by point to obtain a first frequency domain feature map and a second frequency domain feature map, and then convert it to the spatial domain by inverse fast Fourier transform to obtain a first global feature map and a second global feature map, and finally splice the first global feature map and the second global feature map to obtain a global feature map of the image to be identified. Wherein, the solid arrow indicates that its data flow (feature data) is in real number form, and the dotted arrow indicates that its data flow is in imaginary number format, that is, feature data in the frequency domain.

Corresponding to the above-mentioned image recognition method based on the convolutional neural network model, FIG4 shows a flow chart of a convolutional neural network model training method provided in an embodiment of the present application, which is described in detail as follows:

The constructed convolutional neural network model is obtained, and the sample image is input into the convolutional neural network for training until the convolutional neural network meets the preset requirements to obtain the convolutional neural network model.

Among them, the above convolutional neural network performs global convolution on the sample image in the frequency domain based on fast Fourier transform.

Optionally, before training the convolutional neural network, a convolutional neural network is pre-built according to user needs, that is, the network structure of the convolutional neural network is set according to the needs of the user's image recognition task (such as being built based on the existing ResNet and VGGNet networks) to achieve the corresponding image recognition task. For example, the user needs to train a convolutional neural network model for target detection. In order to detect targets of different sizes in an image and achieve better detection results, a convolutional neural network can be built based on the SSD (Single Shot MultiBox Detector) network structure to detect targets of different scales by detecting at different feature scales.

Specifically, a pre-constructed convolutional neural network is obtained, and the corresponding sample image is used as the input of the convolutional neural network for training until the convolutional neural network meets the preset requirements (such as the recognition accuracy of the convolutional neural network reaches a preset threshold, such as 0.99), then the training of the convolutional neural network is stopped to obtain a trained convolutional neural network model. Wherein, when training the convolutional neural network, the sample image and the weight matrix are processed by fast Fourier transform to convert them into sample images and weight matrices in the frequency domain, so as to extract global features of the sample image using the weight matrix from the frequency domain, that is, the input image in the spatial domain is converted into the form of the frequency domain based on the fast Fourier transform, and then the converted sample image in the frequency domain is multiplied, so as to realize the rapid extraction of global features, so that in the process of extracting global features, and when the resolution of the image is large, the amount of calculation of extracting the global features of the image can be effectively reduced.

Optionally, the sample images are labeled sample images corresponding to the image recognition tasks performed by the user, so that the sample images can be directly used for training without labeling the sample images. When using sample images for training, part of the sample images can be used as training sets, and part of the sample images can be used as validation sets and test sets to adjust the convolutional neural network and obtain a good convolutional neural network model. For example, when training a convolutional neural network model for pedestrian re-identification, the Market1501 dataset can be used as a training set for training. Market1501 contains 32,217 images of 1,501 pedestrians taken by 6 cameras. Each pedestrian is captured by at least 2 cameras, and there may be multiple images in one camera, which are divided into training sets and test sets.

Optionally, when the sample image is used as the input of the convolutional neural network for training, the sample image input for a single training can be one or more, such as 100. When the sample images are input in batches, the number of input sample images is represented as the batch size Batch Size. Correspondingly, the characteristic shape of the sample image extracted by the convolutional neural network can be represented in a four-dimensional format (B, H, W, C), where B represents the batch size, H represents the height, W represents the width, and C represents the channel.

In the embodiment of the present application, a convolutional neural network is pre-constructed according to the needs of the user, and the labeled sample image is used as the input of the constructed convolutional neural network for training until the convolutional neural network meets the preset requirements, and a trained convolutional neural network model is obtained. Since the input image is converted from the spatial domain to the frequency domain based on the fast Fourier transform, the convolution operation in the spatial domain is converted into the multiplication operation in the frequency domain, thus effectively reducing the amount of calculation in the process of global feature extraction, reducing the computational complexity of the convolutional neural network model for large-scale input images, facilitating deployment and operation on devices with lower computing power, and also improving the computing speed of the convolutional neural network model.

Corresponding to the above-mentioned image recognition method based on the convolutional neural network model or the convolutional neural network model training method, the image recognition method based on the convolutional neural network model is introduced below based on some application scenarios.

1. Remote sensing detection

High-resolution remote sensing images have the characteristics of containing information buildings and complex natural scenes. A remote sensing image often contains a large number of buildings, sites, vegetation, farmland and other types of ground objects and geomorphic elements. Target detection of remote sensing images has always been a hot research topic. Most of the existing remote sensing image target detection models have deep structures and complex connection channels, and remote sensing image data is more and larger than natural images. The global feature extraction of remote sensing images by ordinary convolution using large convolution kernels is too computationally complex and has low detection efficiency, which also limits the deployment and use of the model in many scenarios with limited computing resources. The image recognition method based on the convolutional neural network model provided in this application is precisely aimed at the problem of large amount of computation of global features of large-size images extracted by convolution. By converting the convolution operation in the spatial domain into multiplication operation in the frequency domain, the global features of the image to be detected are extracted, which effectively reduces the computation of global feature extraction, thereby improving the detection efficiency of the convolutional neural network model.

First, the collected image to be detected is input into the convolutional neural network model (target detection model), and the local features of the image to be detected are extracted through the first convolution module to obtain the feature information such as edges, corners, lines, etc. of the image to be detected, and a local feature map is obtained, which is then used as the input of the second convolution module. In the second convolution module, the position features in the image to be detected are embedded into the local feature map through the position embedding module so that it contains more position information. The local feature map is then split along the channel direction to obtain the first local feature map and the second local feature map, which are input into the first branch and the second branch. The first local feature map and the second local feature map are processed by one-dimensional fast Fourier transform along the column dimension and the row dimension to obtain the first local feature map and the second local feature map in the frequency domain. At the same time, the weight matrices corresponding to the first local feature map and the second local feature map are processed by one-dimensional fast Fourier transform, so that the weight matrix in the frequency domain is multiplied point by point with the corresponding first local feature map and the second local feature map in the frequency domain to obtain the first frequency domain feature map and the second frequency domain feature map, which are inversely fast Fourier transformed and spliced to obtain a complete global feature map. Finally, the global feature map is detected by the detection (recognition) module to output the corresponding detection result. In the process of global feature extraction, the local feature map is split and calculated, which reduces the amount of calculation, and a one-dimensional fast Fourier transform is performed separately, converting the convolution operation in the spatial domain into a multiplication operation determined by the frequency domain, which greatly reduces the amount of calculation for global feature extraction and effectively improves the target detection efficiency for remote sensing images.

2. Face recognition

Face recognition technology is currently widely used in smart access control, security monitoring and other fields. Since face recognition needs to extract facial features from images for recognition, images with high clarity, that is, high resolution, are usually collected for face recognition. For example, face recognition used in smart access control requires more accurate recognition to improve security. At this time, the camera resolution is required to be higher, but the computing resources of the smart access control system are limited. When extracting global features from high-resolution images for recognition, its efficiency is low, which is not conducive to practical application. The image recognition method based on the convolutional neural network model provided in this application can be deployed in the smart access control system, effectively reducing the amount of computational effort in extracting global features from high-resolution images and improving face recognition efficiency.

It should be understood that the size of the serial numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Embodiment 2:

Corresponding to the image recognition method based on the convolutional neural network model described in the above embodiment, Figure 5 shows a structural block diagram of the image recognition device based on the convolutional neural network model provided in the embodiment of the present application. For the sake of convenience of explanation, only the parts related to the embodiment of the present application are shown.

5 , the device includes: an input module 51 and a convolutional neural network model 52. The convolutional neural network model performs frequency domain global convolution on the image to be identified based on fast Fourier transform.

An input module 51, used to input the image to be recognized into the above-mentioned convolutional neural network model;

The convolutional neural network model 52 is used to extract features and recognize the above-mentioned images to be recognized in sequence to obtain recognition results.

In the embodiment of the present application, the image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model sequentially extracts features and identifies the image to be identified to obtain the identification result. Since the global convolution in the frequency domain is performed on the image to be identified based on the fast Fourier transform, the convolution operation in the spatial domain is converted into a multiplication operation in the frequency domain. Therefore, when extracting the global features of the image to be identified, the amount of calculation in the process of extracting the global features of large-size images by the convolutional neural network model can be effectively reduced, thereby improving the recognition efficiency of the convolutional neural network model and facilitating the deployment and application on devices with lower computing power.

In some embodiments, the image recognition device further includes:

The module for acquiring the image to be identified is used to acquire the image to be identified.

In some embodiments, the convolutional neural network model 52 includes:

A feature extraction unit, used to extract features from the above-mentioned image to be identified;

The recognition unit is used to recognize the extracted features and obtain recognition results.

In some embodiments, the feature extraction unit includes:

A first convolution unit is used to extract local features of the image to be identified to obtain a local feature map;

The second convolution unit is used to perform frequency domain global convolution on the local feature map by using fast Fourier transform to obtain a global feature map.

In some embodiments, the second convolution unit includes:

A splitting unit, used for splitting the local feature map along the channel direction to obtain a first local feature map and a second local feature map;

A first branch unit is used to perform frequency domain global convolution on the first local feature map by using a fast Fourier transform to obtain a first global feature map;

The second branch unit is used to perform frequency domain global convolution on the second local feature map by using fast Fourier transform to obtain a second global feature map;

The splicing unit is used to splice the first global feature map and the second global feature map along the channel direction to obtain a global feature map.

In some embodiments, the first branch unit comprises:

A first transformation unit is used to perform fast Fourier transform processing on the first local feature map and the corresponding weight matrix based on the column dimension to obtain the first local feature map and the weight matrix in the frequency domain;

A first convolution unit is used to multiply the first local feature map in the frequency domain and the weight matrix point by point to obtain a first frequency domain feature map;

The first inverse transform unit is used to perform a fast Fourier inverse transform on the first feature map in the frequency domain to obtain a first global feature map.

The second branch unit comprises:

A second transformation unit is used to perform fast Fourier transform processing on the second local feature map and the corresponding weight matrix based on the row dimension to obtain the second local feature map and the weight matrix in the frequency domain;

A second convolution unit is used to multiply the second local feature map in the frequency domain and the weight matrix point by point to obtain a second frequency domain feature map;

The second inverse transform unit is used to perform inverse fast Fourier transform processing on the second feature map in the frequency domain to obtain a second global feature map.

In some embodiments, the second convolution unit further includes:

The position embedding unit is used to extract features from the local feature map through the position embedding module to obtain a position feature map, and add the position feature map to the local feature map to obtain a local feature map containing position features.

Corresponding to the training method of the convolutional neural network model described in the above embodiment, FIG6 shows a structural block diagram of a convolutional neural network model training device provided in an embodiment of the present application. Referring to FIG6 , the device includes:

The training module 61 is used to obtain the constructed convolutional neural network model and input the sample image into the convolutional neural network for training until the convolutional neural network meets the preset requirements to obtain the convolutional neural network model, wherein the convolutional neural network performs frequency domain global convolution on the sample image based on fast Fourier transform.

It should be noted that the information interaction, execution process, etc. between the above-mentioned devices/units are based on the same concept as the method embodiment of the present application. Their specific functions and technical effects can be found in the method embodiment part and will not be repeated here.

Embodiment three:

FIG7 is a schematic diagram of the structure of a terminal device provided in an embodiment of the present application. As shown in FIG7 , the terminal device 7 of this embodiment includes: at least one processor 70 (only one processor is shown in FIG7 ), a memory 71, and a computer program 72 stored in the memory 71 and executable on the at least one processor 70, and when the processor 70 executes the computer program 72, the steps in any of the above-mentioned method embodiments are implemented.

For example, the computer program 72 can be divided into an input module 51 and a convolutional neural network model 52. The convolutional neural network model performs frequency domain global convolution on the image to be identified based on fast Fourier transform. The specific functions of each module are as follows:

Alternatively, the computer program 72 may be divided into training modules 61, and the specific functions of the modules are as follows:

The terminal device 7 may be a computing device such as a desktop computer, a notebook, a PDA, a cloud server, etc. The terminal device may include, but not limited to, a processor 70 and a memory 71. Those skilled in the art will appreciate that FIG. 7 is merely an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. The terminal device 7 may include more or fewer components than shown in the figure, or may combine certain components, or different components, and may also include, for example, input and output devices, network access devices, etc.

The processor 70 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc.

In some embodiments, the memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or memory of the terminal device 7. In other embodiments, the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card (Flash Card), etc. equipped on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program. The memory 71 may also be used to temporarily store data that has been output or is to be output.

The technicians in the relevant field can clearly understand that for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiment can be integrated in a processing unit, or each unit can exist physically separately, or two or more units can be integrated in one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the scope of protection of this application. The specific working process of the units and modules in the above-mentioned system can refer to the corresponding process in the aforementioned method embodiment, which will not be repeated here.

An embodiment of the present application also provides a network device, which includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor implements the steps in any of the above-mentioned method embodiments when executing the computer program.

An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments can be implemented.

An embodiment of the present application provides a computer program product. When the computer program product is run on a terminal device, the terminal device can implement the steps in the above-mentioned method embodiments when executing the computer program product.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present application implements all or part of the processes in the above-mentioned embodiment method, which can be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, the steps of the above-mentioned various method embodiments can be implemented. Among them, the computer program includes computer program code, which can be in source code form, object code form, executable file or some intermediate form. The computer-readable medium may at least include: any entity or device that can carry the computer program code to the camera/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium. For example, USB flash drive, mobile hard disk, disk or optical disk. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electric carrier signals and telecommunication signals.

In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described or recorded in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

In the embodiments provided in the present application, it should be understood that the disclosed devices/network equipment and methods can be implemented in other ways. For example, the device/network equipment embodiments described above are merely schematic. For example, the division of the modules or units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The embodiments described above are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, a person skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. Such modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application, and should all be included in the protection scope of the present application.

Claims

An image recognition method based on a convolutional neural network model, characterized in that the convolutional neural network model performs frequency domain global convolution on the image to be recognized based on fast Fourier transform;

The image recognition method comprises:

The image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model is used to extract features and identify the image to be identified in turn to obtain a recognition result.
The image recognition method according to claim 1, characterized in that the convolutional neural network model includes a feature extraction module and a recognition module, and the convolutional neural network model is used to extract features and recognize the image to be recognized in sequence to obtain a recognition result, comprising:

Extracting features of the image to be identified by the feature extraction module;

The extracted features are identified based on the identification module to obtain an identification result.
The image recognition method according to claim 2, characterized in that the feature extraction module includes a first convolution module and a second convolution module, the second convolution module performs frequency domain global convolution on the image to be recognized based on fast Fourier transform, and the feature extraction of the image to be recognized by the feature extraction module includes:

Extracting local features of the image to be identified based on the first convolution module to obtain a local feature map;

Based on the second convolution module, a fast Fourier transform is used to perform frequency domain global convolution on the local feature map to obtain a global feature map.
The image recognition method according to claim 3, characterized in that the second convolution module includes a first branch and a second branch, and the method of performing frequency domain global convolution on the local feature map using fast Fourier transform based on the second convolution module to obtain a global feature map includes:

Splitting the local feature map along the channel direction to obtain a first local feature map and a second local feature map, and inputting the first local feature map and the second local feature map into the first branch and the second branch respectively;

The first branch and the second branch respectively perform frequency domain global convolution on the input local feature map using fast Fourier transform to obtain a first global feature map and a second global feature map;

The first global feature map and the second global feature map are concatenated along a channel direction to obtain a global feature map.
The image recognition method according to claim 4, characterized in that the first branch and the second branch respectively use fast Fourier transform to perform frequency domain global convolution on the input local feature map to obtain a first global feature map and a second global feature map, comprising:

The first branch performs fast Fourier transform processing on the first local feature map and the corresponding weight matrix based on the column dimension to obtain the first local feature map and the weight matrix in the frequency domain;

Multiplying the first local feature map in the frequency domain and the weight matrix point by point to obtain a first frequency domain feature map;

Performing inverse fast Fourier transform processing on the first feature map in the frequency domain to obtain a first global feature map;

The second branch performs fast Fourier transform processing on the second local feature map and the corresponding weight matrix based on the row dimension to obtain a second local feature map and a weight matrix in the frequency domain;

Multiplying the second local feature map in the frequency domain and the weight matrix point by point to obtain a second frequency domain feature map;

Perform inverse fast Fourier transform processing on the second feature map in the frequency domain to obtain a second global feature map.
The image recognition method according to claim 4, characterized in that the second convolution module further includes a position embedding module, and before splitting the local feature map along the channel direction, further includes:

The position embedding module is used to perform feature extraction on the local feature map to obtain a position feature map, and the position feature map is added to the local feature map to obtain a local feature map containing position features.
A convolutional neural network model training method, characterized by comprising:

Acquire the constructed convolutional neural network model, and input the sample image into the convolutional neural network for training until the convolutional neural network meets the preset requirements, thereby obtaining a convolutional neural network model;

Wherein, the convolutional neural network performs frequency domain global convolution on the sample image based on fast Fourier transform.
An image recognition device, characterized in that it comprises:

An input module and a trained convolutional neural network model, wherein the convolutional neural network model performs frequency domain global convolution on the image to be identified based on fast Fourier transform;

The input module is used to input the image to be recognized into the convolutional neural network model;

The convolutional neural network model is used to extract and identify features of the image to be identified in sequence to obtain a recognition result.
A terminal device comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the image recognition method as described in any one of claims 1 to 6 or the convolutional neural network model training method as described in claim 7 is implemented.
A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the image recognition method according to any one of claims 1 to 6 or the convolutional neural network model training method according to claim 7.