WO2024077781A1 - 基于卷积神经网络模型的图像识别方法、装置及终端设备 - Google Patents

基于卷积神经网络模型的图像识别方法、装置及终端设备 Download PDF

Info

Publication number
WO2024077781A1
WO2024077781A1 PCT/CN2022/142241 CN2022142241W WO2024077781A1 WO 2024077781 A1 WO2024077781 A1 WO 2024077781A1 CN 2022142241 W CN2022142241 W CN 2022142241W WO 2024077781 A1 WO2024077781 A1 WO 2024077781A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
image
neural network
convolutional neural
local feature
Prior art date
Application number
PCT/CN2022/142241
Other languages
English (en)
French (fr)
Inventor
张号逵
杨涛
胡文泽
王孝宇
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2024077781A1 publication Critical patent/WO2024077781A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application belongs to the field of image recognition technology, and in particular, relates to an image recognition method, apparatus, terminal device, and computer-readable storage medium based on a convolutional neural network model.
  • Convolutional neural network is a special kind of artificial neural network. It has become the most commonly used technology in the field of speech analysis and image recognition, and is therefore also a hot topic in artificial neural network research.
  • Convolutional neural network is usually a multi-layer neural network, each layer consists of multiple feature maps, and each feature map consists of multiple independent neurons. Neurons in the same feature map share weights (i.e., convolution kernels).
  • Convolutional neural network reduces the connection between each layer of the network by sharing weights, while reducing the risk of overfitting.
  • the embodiments of the present application provide an image recognition method, an apparatus, and a terminal device based on a convolutional neural network model, which can improve the image recognition effect of the convolutional neural network model on images of different sizes.
  • an embodiment of the present application provides an image recognition method based on a convolutional neural network model, wherein the convolutional neural network model performs global feature extraction on the image to be recognized based on a weight matrix of the image to be recognized, and the image recognition method comprises:
  • the image to be identified is input into the trained convolutional neural network model, and the convolutional neural network model is used to extract features and identify the image to be identified in turn to obtain a recognition result.
  • an embodiment of the present application provides a convolutional neural network model training method, comprising:
  • the convolutional neural network extracts global features of the sample image based on the weight matrix of the sample image.
  • an image recognition device including:
  • An input module and a trained convolutional neural network model wherein the convolutional neural network model performs global feature extraction on the image to be identified based on a weight matrix of the image to be identified;
  • the input module is used to input the image to be recognized into the convolutional neural network model
  • the convolutional neural network model is used to extract and recognize features of the image to be recognized in sequence to obtain a recognition result.
  • an embodiment of the present application provides a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the steps of the image recognition method based on the convolutional neural network model described in the first aspect or the convolutional neural network model training method described in the second aspect are implemented.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program.
  • the computer program When executed by a processor, it implements the steps of the image recognition method based on the convolutional neural network model described in the first aspect or the convolutional neural network model training method described in the second aspect.
  • an embodiment of the present application provides a computer program product.
  • the terminal device executes the image recognition method based on the convolutional neural network model described in any one of the first aspect or the convolutional neural network model training method described in the second aspect.
  • the image to be recognized is input into the trained convolutional neural network model, and the convolutional neural network model sequentially extracts features and recognizes the image to be recognized to obtain a recognition result. Since the convolutional neural network model extracts global features of the image to be recognized based on the weight matrix of the image to be recognized, when performing image recognition on images to be recognized of different sizes, the weight matrix can be dynamically adjusted according to each image to be recognized to extract global features of the image to be recognized, so that the convolutional neural network model has a good recognition effect on input images of different sizes.
  • FIG1 is a schematic diagram of a flow chart of an image recognition method based on a convolutional neural network model provided by an embodiment of the present application;
  • FIG2 is a schematic diagram of the structure of a convolutional neural network model provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of a convolution module for dynamically extracting a weight matrix provided in an embodiment of the present application
  • FIG4 is a schematic diagram of the structure of a second convolution module provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a flow chart of a convolutional neural network model training method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an image recognition device based on a convolutional neural network model provided in an embodiment of the present application;
  • FIG7 is a schematic diagram of the structure of a convolutional neural network model training device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a terminal device provided in an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” etc. described in the specification of this application mean that one or more embodiments of the present application include specific features, structures or characteristics described in conjunction with the embodiment. Therefore, the phrases “in one embodiment”, “in some embodiments”, “in some other embodiments”, “in some other embodiments”, etc. appearing in different places in this specification do not necessarily refer to the same embodiment, but mean “one or more but not all embodiments", unless otherwise specifically emphasized in other ways.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • FIG1 shows a schematic flow chart of an image recognition method based on a convolutional neural network model provided by an embodiment of the present invention, which is described in detail as follows:
  • the image to be identified is input into a trained convolutional neural network model, and the convolutional neural network model is used to extract features and identify the image to be identified in turn to obtain a recognition result.
  • the above-mentioned convolutional neural network model extracts global features of the image to be identified based on the weight matrix of the image to be identified to perform image recognition, that is, extracts global features of the image to be identified according to the weight matrix of the currently input image to be identified.
  • the weight matrix mentioned above refers to the convolution kernel of the convolutional neural network.
  • image processing a small area of pixels in a given input image is weighted averaged to become each corresponding pixel in the output image, where the weight is defined by a function, which is the convolution kernel.
  • the weight matrix (convolution kernel) of the convolutional neural network is fixed.
  • the same fixed weight matrix cannot be used to dynamically adjust the weight matrix with the input image, and the traditional convolution operation only has a local receptive field, which cannot well extract the global features of the image, affecting the accuracy of image recognition. Therefore, when the image to be identified is input into the above-mentioned convolutional neural network model for feature extraction, the global features of the image to be identified are extracted according to the weight matrix of the image to be identified, so as to identify the image according to the obtained features and obtain the image recognition result, so that the above-mentioned convolutional neural network model has both a global receptive field and dynamic weights, thereby improving the recognition accuracy of the convolutional neural network model.
  • the global features of the image to be identified are extracted according to the weight matrix a of the image to be identified, and the image to be identified is identified according to the obtained global features to obtain the recognition result of the image to be identified.
  • the global features of the image to be identified are extracted and identified according to the weight matrix b of the image to be identified.
  • the image to be identified is input into a trained convolutional neural network model, and the convolutional neural network model sequentially extracts features and identifies the image to be identified to obtain an identification result. Since the convolutional neural network model extracts global features of the image to be identified according to the weight matrix of the image to be identified during the feature extraction process, the convolutional neural network model has both a global receptive field and dynamic weights, and therefore, it can better extract features from images to be identified of different sizes, thereby improving the recognition accuracy of the convolutional neural network model.
  • the above-mentioned image recognition method based on the convolutional neural network model further includes:
  • the image to be identified may be an image captured by a camera device, or may be an image frame in a video stream captured by a camera device.
  • the adopted camera equipment, the rules for collecting images to be detected, etc. are also different, therefore, the corresponding images to be recognized are obtained according to the corresponding collection methods and collection rules of each application field, etc.
  • the corresponding images to be recognized are obtained according to the corresponding collection methods and collection rules of each application field, etc.
  • the convolutional neural network model includes a feature extraction module and a recognition module.
  • the steps of extracting features and recognizing the image to be recognized in sequence through the convolutional neural network model to obtain a recognition result include:
  • the extracted features are recognized to obtain recognition results.
  • the feature extraction module is used to extract features of the input image to be recognized, and the extracted features are used as inputs of the recognition module. According to the image recognition task, corresponding recognition is performed to obtain a recognition result.
  • the recognition module may include one or more recognition units, and different recognition units perform different recognition tasks. For example, it may include a face recognition unit and a target detection unit. The face recognition unit performs the face recognition task on the extracted features, or the extracted features are distributed and input into the face recognition unit and the target detection unit to perform the face recognition and target detection tasks.
  • features are extracted from the image to be identified through a feature extraction module, and the extracted features are obtained through a recognition module for corresponding identification to obtain corresponding identification results, so as to improve the recognition efficiency of each image recognition task.
  • the feature extraction module includes a first convolution module and a second convolution module
  • step A1 includes:
  • the above-mentioned convolutional neural network model can be constructed based on an existing convolutional neural network, with the shallow convolution layer as the first convolution module, and the deep convolution layer or self-attention replaced by the second convolution module, the above-mentioned first convolution module uses ordinary convolution to perform convolution processing on the image to be identified, extracts the local features of the image to be identified, outputs the local feature map of the image to be identified, and uses the local feature map as the input of the second convolution module, the above-mentioned second module obtains the weight matrix of the local feature map by dynamic convolution and extracts the global features of the local feature map according to the weight matrix to obtain the global feature map.
  • the first two layers are the first convolution module of ordinary convolution
  • the four deeper convolution layers are the second convolution module
  • the second convolution module is connected to the recognition module
  • the image to be identified is used as the input of the first convolution module for local feature extraction
  • the output features are input to the second convolution module for global feature extraction
  • the global feature map output by the above-mentioned second convolution module is used as the input of the recognition module for recognition
  • the corresponding recognition result is output.
  • first convolution module and the second convolution module in the above-mentioned convolutional neural network model can also adopt a cross-appearing structure, that is, the first convolution module is connected to the second convolution module, and the output of the second convolution module is connected to another first convolution module.
  • the structure shown in Figure 2 can also be stacked multiple times, so that the feature map output by the feature extraction module is the global feature map extracted by the second convolution module (that is, the recognition module performs recognition based on the global features extracted by the second convolution module), and the specific structure of the first convolution module (ordinary convolution layer) in the convolutional neural network model and the second convolution module provided in the embodiment of the present application is not limited.
  • local features of the image to be identified are extracted by the first convolution module of the convolutional neural network model, and the obtained local feature map is used as the input of the second convolution module, and global features are extracted from the local feature map according to the weight matrix of the local feature map. Therefore, the obtained global features contain both local features and global features, thereby improving the recognition accuracy of the convolutional neural network model.
  • the second convolution module includes a first branch and a second branch
  • the step A12 includes:
  • the local feature map is split along the channel direction to obtain a first local feature map and a second local feature map, and the first local feature map and the second local feature map are input into the first branch and the second branch respectively.
  • the local feature map is first split along the channel direction (such as evenly split into two parts along the channel direction) to obtain a first local feature map and a second local feature map, and the first local feature map is input into the first branch, and the second local feature map is input into the second branch, so as to extract global features from the first local feature map and the second local feature map, respectively.
  • the first branch and the second branch respectively extract features from the corresponding local feature maps according to the weight matrix of the input local feature maps to obtain a first global feature map and a second global feature map.
  • the first branch generates a weight matrix of the first local feature map according to the first local feature map, and extracts global features of the first local feature map according to the weight matrix to obtain a first global feature map
  • the second branch generates a weight matrix of the second local feature map according to the second local feature map, and extracts global features of the second local feature map according to the weight matrix to obtain a second global feature map.
  • the first global feature map and the second global feature map are concatenated along the channel direction to obtain a global feature map.
  • the first local feature map and the second local feature map are obtained after the local feature map is split along the channel direction
  • the first branch and the second branch respectively extract the features of the first local feature map and the second local feature map
  • the first global feature map and the second global feature map are spliced along the channel direction to obtain a complete global feature map, so as to perform subsequent processing according to the complete global feature map of the input image to be identified.
  • the local feature map is split into two parts along the channel direction and input into the first branch and the second branch, the number of channels of the local feature map is halved. Therefore, global features are extracted based on the local feature map with halved channel number, which reduces the computational complexity of global feature extraction, thereby reducing the requirements on the computing power of the device and facilitating the deployment of applications on devices with lower computing power.
  • the process when the first branch and the second branch extract global features according to the input local feature map, the process includes:
  • a weight matrix of a column dimension is generated according to the first local feature map to obtain a first weight matrix, and convolution processing is performed on the first local feature map along the column dimension according to the first weight matrix to obtain a first global feature map.
  • the first branch performs feature extraction on the first local feature map
  • convolution processing is performed along the column dimension (i.e., the H dimension) according to the first local feature map, and a weight matrix of the first local feature map is dynamically generated to obtain a first weight matrix.
  • global feature extraction is performed on the first local feature map according to the first weight matrix to obtain a first global feature map.
  • the first local feature map is subjected to one-dimensional convolution processing along the column dimension, that is, a convolution operation is performed on each row of pixels in the height direction.
  • the first local feature map is represented as (B, C, H, W)
  • a lightweight one-dimensional convolution module is used to perform convolution processing along the column dimension on the first local feature map to generate the first weight matrix (B, C, H, 1) of the first local feature map. Since the first weight matrix is obtained by performing a convolution operation on each row of pixels in the column dimension, the first weight matrix has a global receptive field.
  • the first local feature map is subjected to circular convolution along the column dimension according to the first weight matrix to extract the global features of the first local feature map and obtain the first global feature map.
  • a deep separable convolution is used for processing to reduce the amount of parameters and the amount of calculation.
  • the one-dimensional convolution module adopts a lightweight network structure with two convolution layers, and performs average pooling on the first local feature map (B, C, H, W) along the column dimension to reduce the dimension of the first local feature map to obtain a first local feature map of the shape of (B, C, H, 1).
  • a deep separable convolution with a 1 ⁇ 3 convolution kernel is performed on the above-mentioned first local feature map, and a HardSwish activation function and batch normalization are used, and then a 1 ⁇ 3 deep separable convolution is performed to output the first weight matrix, wherein the HardSwish activation function is an improvement on the Swish nonlinear activation function.
  • the Swish activation function can improve the accuracy of the convolutional neural network to a certain extent, but its computational cost is high and is not suitable for use on embedded mobile devices.
  • the HardSwish activation function can be implemented as a segmentation function to reduce the number of memory accesses, which improves the accuracy of the convolutional neural network and is convenient for deployment and use on embedded mobile devices.
  • a weight matrix of a row dimension is generated according to the local feature map of the second branch to obtain a second weight matrix, and convolution processing is performed on the second local feature map along the row dimension according to the second weight matrix to obtain a second global feature map.
  • a weight matrix of the second local feature map is dynamically generated based on the second local feature map to obtain a second weight matrix, and then convolution processing is performed on the second weight matrix and the second local feature map to obtain a second global feature map.
  • one-dimensional convolution processing is performed on the second local feature map along the row dimension (i.e., the horizontal dimension), that is, a convolution operation is performed on each column of pixels in the width direction.
  • the second local feature map is represented as (B, C, H, W)
  • a lightweight one-dimensional convolution module is used to perform convolution processing on the second local feature map along the W dimension to generate the second weight matrix (B, C, 1, W) of the second local feature map.
  • the second local feature map is circularly convolved along the row dimension according to the second weight matrix to extract the global features of the second local feature map to obtain the second global feature map.
  • the depth-separable convolution can be used for processing as in the first branch to reduce the number of parameters and the amount of calculation.
  • the one-dimensional convolution module adopts a lightweight network structure as shown in the figure, wherein average pooling is performed along the row dimension when average pooling is performed to obtain a second weight matrix of the shape of (B, C, 1, W), and the 1 ⁇ 3 depth-separable convolution is replaced by a 3 ⁇ 1 depth-separable convolution.
  • the convolutional neural network model since a one-dimensional weight matrix is used to perform circular convolution processing on the local feature map from different dimensions to obtain global features of different dimensions, and the obtained global features are spliced to obtain global features including width and height directions, the amount of calculation when extracting the global features of the local feature map is reduced.
  • the weight matrices of the first local feature map and the second local feature map are dynamically extracted respectively, and then the global features of the corresponding local feature maps are extracted according to their weight matrices, so that the convolutional neural network model has good adaptability to images to be identified of different scales, thereby improving the image recognition effect of the convolutional neural network model.
  • the second convolution module further includes a position embedding module, and the above steps further include:
  • the position embedding module extracts features from the local feature map according to the weight matrix of the local feature map to obtain a position feature map, and adds the position feature map to the local feature map to obtain a local feature map containing position features.
  • convolution processing is performed on the local feature map through a position embedding module, the weight matrix of the local feature map is dynamically generated and local feature extraction is performed on the local feature map according to the weight matrix to generate a position feature map, and the position feature map is added to the local feature map to obtain a local feature map in which the position feature is embedded.
  • the position embedding module is a two-dimensional convolution module, which performs convolution processing on the input sample image to generate a two-dimensional position feature map, that is, the size of the position feature map is made consistent with the resolution size of the input local feature map, so as to directly add the position feature map to the local feature map to obtain a local feature map containing position features.
  • the position embedding module adopts a two-layer lightweight convolutional network structure, that is, a simple structure of "convolution + normalization processing + activation function + convolution". Since it is necessary to generate a two-dimensional position feature map, the convolution layer can use a 3 ⁇ 3 depth-separable convolution to perform convolution processing on the local feature map to generate the position feature map.
  • the position features of an image can enhance the ability to describe and distinguish the image content
  • the position features of the local feature map are extracted based on the weight matrix of the local feature map, and the position features are embedded into the local feature map so that the local feature map contains the position features. Therefore, the recognition accuracy can be improved when image recognition is subsequently performed based on the global feature map containing the position features.
  • the network structure of the second convolution module is shown in FIG4 , and may include a position embedding module, a first branch, and a second branch.
  • the local feature map embedded with the position feature is split to obtain the first local feature map and the second local feature map and input them into the first branch and the second branch, respectively.
  • the first branch and the second branch extract global features based on the row dimension and the column dimension, respectively, according to the weight matrix of the first local feature map and the second local feature map, to obtain the first global feature map and the second global feature map, and then splice the first global feature map and the second global feature map to obtain the global feature map.
  • the first branch performs a circular convolution process on the first local feature map along the column dimension according to the first weight matrix.
  • H refers to the number of rows (i.e., height) of the first local feature map
  • k the kth weight (convolution kernel) in the first weight matrix
  • w H means that if the length of the first weight matrix is not H, the first weight matrix is interpolated to make its length H
  • (()) H means the periodic extension of the first local feature map to H points
  • x P refers to the first local feature map with the position feature added
  • p namely pe, represents the position feature corresponding to the first local feature map.
  • FIG5 shows a flow chart of a convolutional neural network model training method provided in an embodiment of the present application, which is described in detail as follows:
  • the constructed convolutional neural network model is obtained, and the sample image is input into the convolutional neural network for training until the convolutional neural network meets the preset requirements to obtain the convolutional neural network model.
  • a convolutional neural network is constructed according to user needs (for example, if a face recognition task is required, the convolutional neural network needs to set a classification head to classify and recognize face images), that is, the network structure of the convolutional neural network is set according to user needs (for example, it can be constructed based on the existing VGGNet and AlexNet network structures), and the constructed convolutional neural network is initialized.
  • the convolutional neural network is constructed based on the network structure of the residual network ResNet18, wherein the residual network ResNet18 includes 17 convolutional layers and 1 fully connected layer.
  • a pre-constructed convolutional neural network is obtained, and the corresponding sample image is used as the input of the convolutional neural network for training until the convolutional neural network meets the preset requirements (such as the recognition accuracy of the convolutional neural network reaches a preset threshold, such as 0.988), then the training of the convolutional neural network is stopped to obtain a trained convolutional neural network model.
  • the convolutional neural network extracts global features of the input image according to the weight matrix of the input image, and the input image is the input sample image.
  • the above-mentioned convolutional neural network extracts global features of the input image according to the weight matrix of the input image, it generates a weight matrix of the input image through convolution, thereby extracting global features of the input image according to the weight matrix. It should be pointed out that since the above-mentioned convolutional neural network adopts dynamic weights instead of a fixed weight matrix, when the parameters of the convolutional neural network are updated according to the loss function, when the weight matrix is updated, the parameters of the convolution module that generates the weight matrix of the input image by convolution are updated.
  • the sample images are labeled sample images corresponding to the image recognition tasks performed by the user, so that the sample images can be directly used for training without labeling the sample images.
  • the CelebA face attribute data set is used as the training set.
  • CelebA contains 202,599 face images of 10,177 identities. Each image is marked with features and can be used as the input of the convolutional neural network to train the convolutional neural network.
  • part of the sample images can be used as training sets, and part of the sample images can be used as verification sets and test sets to adjust the convolutional neural network and obtain a good convolutional neural network model.
  • the sample image input for a single training can be one or more, such as 100.
  • the number of input sample images is represented as the batch size Batch Size.
  • the characteristic shape of the sample image extracted by the convolutional neural network can be represented in a four-dimensional format (B, H, W, C), where B represents the batch size, H represents the height, W represents the width, and C represents the channel.
  • a convolutional neural network is constructed according to user needs, and labeled sample images are used as inputs of the convolutional neural network for training until the convolutional neural network meets preset requirements, thereby obtaining a trained convolutional neural network model.
  • the convolutional neural network includes a dynamic convolution module for performing global feature extraction on the input image according to the weight matrix of the input image
  • features are extracted for each sample image according to the weight matrix corresponding to each sample image, and each sample image does not need to use the same static weight matrix, so that the convolutional neural network can adapt well to sample images of different scales during the training process, thereby reducing the difficulty of multi-scale training of the convolutional neural network and enabling the convolutional neural network model to have a good recognition effect on multi-scale images.
  • the image recognition method based on the convolutional neural network model is introduced below based on some application scenarios.
  • Pedestrian detection has always been a hot topic and difficulty in computer vision research.
  • the problem that needs to be solved in pedestrian detection is: find all pedestrians in an image or video frame, including their position and size, generally represented by a rectangular frame.
  • Pedestrian detection can be combined with technologies such as pedestrian tracking and pedestrian re-identification, and applied to fields such as unmanned driving and intelligent transportation. Therefore, pedestrian detection technology has a strong application value. Since the images to be identified by pedestrian detection are of different sizes, the image recognition method based on the convolutional neural network model provided in this application is precisely aimed at the problem of different image sizes. By dynamically obtaining the weight matrix of each image to be identified, the global features of the image to be identified are extracted, thereby performing pedestrian detection and improving the accuracy of pedestrian detection.
  • the collected image to be identified is input into the convolutional neural network model (pedestrian detection model), and the local features of the image to be identified are extracted by the first convolution module to obtain feature information such as edges, corners, and lines of the image to be identified, and a local feature map is obtained, and the local feature map is used as the input of the second convolution module.
  • pedestrian detection is to detect pedestrians in the image, there is a strong spatial relationship between the features of pedestrians (such as body and head).
  • the local feature map is convolved by the position embedding module to obtain the position feature map of the local feature map, and it is added to the local feature map so that the local feature map contains position information, and then the local feature map is split along the channel direction to obtain the first local feature map and the second local feature map, which are input to the first branch and the second branch, and the first local feature map and the second local feature map are convolved along the column dimension and the row dimension to obtain the weight matrix of the image to be identified, and then the corresponding local feature map is convolved according to the weight matrix to obtain the first global feature map and the second global feature map corresponding to the first local feature map and the second local feature map, and they are spliced along the channel direction to obtain a complete global feature map.
  • the recognition module obtains the global feature map output by the second convolution module, detects the existing pedestrian features based on the global feature map, and outputs the corresponding detection results.
  • the sizes of the images to be tested and the targets to be tested are often not fixed.
  • detection tasks related to autonomous driving may require the detection of large trucks and animals at the same time.
  • Medical lesion detection tasks may require the simultaneous detection of lesions of different sizes.
  • targets of different sizes are usually detected by acquiring images of different sizes to be identified.
  • the model is usually difficult to adapt well to detections with large size differences.
  • the weight matrix can be dynamically adjusted according to the input image to be identified, and the image to be identified can be detected according to its own weight matrix, so as to achieve better detection effect.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • Figure 6 shows a structural block diagram of the image recognition device provided in the embodiment of the present application. For the sake of convenience of explanation, only the parts related to the embodiment of the present application are shown.
  • the device includes: an input module 61 and a convolutional neural network model 62, wherein the convolutional neural network model extracts global features of the image to be identified based on the weight matrix of the image to be identified.
  • An input module 61 used to input the image to be recognized into the above-mentioned convolutional neural network model
  • the convolutional neural network model 62 is used to extract features and recognize the above-mentioned images to be recognized in sequence to obtain recognition results.
  • the image to be identified is input into a trained convolutional neural network model, and the convolutional neural network model sequentially extracts features and identifies the image to be identified to obtain an identification result. Since the convolutional neural network model extracts global features of the image to be identified according to the weight matrix of the image to be identified during the feature extraction process, the convolutional neural network model has both a global receptive field and dynamic weights, and therefore, it can better extract features from images to be identified of different sizes, thereby improving the recognition accuracy of the convolutional neural network model.
  • the image recognition device further includes:
  • the module for acquiring the image to be identified is used to acquire the image to be identified.
  • the convolutional neural network model 62 includes:
  • a feature extraction unit used to extract features from the above-mentioned image to be identified
  • the recognition unit is used to recognize the extracted features and obtain recognition results.
  • the feature extraction unit includes:
  • a first convolution unit is used to extract local features of the image to be identified to obtain a local feature map
  • the second convolution unit is used to extract global features from the local feature map according to the weight matrix of the local feature map to obtain a global feature map.
  • the second convolution unit includes:
  • a splitting unit used for splitting the local feature map along the channel direction to obtain a first local feature map and a second local feature map
  • a first branch unit is used to extract global features from the first local feature map according to the weight matrix of the first local feature map to obtain a first global feature map;
  • a second branch unit is used to extract global features from the second local feature map according to the weight matrix of the second local feature map to obtain a second global feature map;
  • the splicing unit is used to splice the first global feature map and the second global feature map along the channel direction to obtain a global feature map.
  • the first branch unit comprises:
  • a first weight unit used for performing convolution processing on the first local feature image to generate a first weight matrix
  • a global feature extraction unit used for performing global feature extraction on the first local feature map according to the first weight matrix to obtain a first global feature map
  • the second branch unit comprises:
  • a second weight unit used for performing convolution processing on the second local feature image to generate a second weight matrix
  • the global feature extraction unit is used to perform global feature extraction on the second local feature map according to the second weight matrix to obtain a second global feature map.
  • the second convolution unit further includes:
  • the position embedding unit is used to extract features from the local feature map according to the weight matrix of the local feature map to obtain a position feature map, and add the position feature map to the local feature map to obtain a local feature map containing position features.
  • FIG. 7 shows a structural block diagram of a convolutional neural network model training device provided in an embodiment of the present application.
  • the device includes:
  • the training module 71 is used to obtain the constructed convolutional neural network model and input the sample image into the convolutional neural network for training until the convolutional neural network meets the preset requirements to obtain the convolutional neural network model, wherein the convolutional neural network performs global feature extraction on the sample image based on the weight matrix of the sample image.
  • a convolutional neural network is constructed according to user needs, and labeled sample images are used as inputs of the convolutional neural network for training until the convolutional neural network meets preset requirements, thereby obtaining a trained convolutional neural network model.
  • the convolutional neural network includes a dynamic convolution module for performing global feature extraction on the input image according to the weight matrix of the input image
  • features are extracted from each sample image according to the weight matrix corresponding to each sample image, and each sample image does not need to use the same static weight matrix, so that the convolutional neural network can adapt well to sample images of different scales during the training process, thereby reducing the difficulty of multi-scale training of the convolutional neural network and enabling the convolutional neural network model to have a good recognition effect on multi-scale images.
  • FIG8 is a schematic diagram of the structure of a terminal device provided in an embodiment of the present application.
  • the terminal device 8 of this embodiment includes: at least one processor 80 (only one processor is shown in FIG8 ), a memory 81, and a computer program 82 stored in the memory 81 and executable on the at least one processor 80, and when the processor 80 executes the computer program 82, the steps in any of the above-mentioned method embodiments are implemented.
  • the computer program 82 may be divided into one or more modules/units, which are stored in the memory 81 and executed by the processor 80 to complete the present application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, which are used to describe the execution process of the computer program 82 in the terminal device 81.
  • the computer program 82 may be divided into an input module 61 and a convolutional neural network model 62, which performs global feature extraction on the image to be identified based on the weight matrix of the image to be identified.
  • the specific functions of each module are as follows:
  • An input module 61 used to input the image to be recognized into the above-mentioned convolutional neural network model
  • the convolutional neural network model 62 is used to extract features and recognize the above-mentioned images to be recognized in sequence to obtain recognition results.
  • the computer program 82 may be divided into training modules 71, the specific functions of which are as follows:
  • the training module 71 is used to obtain the constructed convolutional neural network model and input the sample image into the convolutional neural network for training until the convolutional neural network meets the preset requirements to obtain the convolutional neural network model, wherein the convolutional neural network performs global feature extraction on the sample image based on the weight matrix of the sample image.
  • the terminal device 8 may be a computing device such as a desktop computer, a notebook, a PDA, a cloud server, etc.
  • the terminal device may include, but not limited to, a processor 80 and a memory 81.
  • FIG8 is merely an example of the terminal device 8 and does not constitute a limitation on the terminal device 8.
  • the terminal device 8 may include more or fewer components than shown in the figure, or may combine certain components, or different components, and may also include, for example, input and output devices, network access devices, etc.
  • the processor 80 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the memory 81 may be an internal storage unit of the terminal device 8, such as a hard disk or memory of the terminal device 8. In other embodiments, the memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart media card (SMC), a secure digital (SD) card, a flash card (Flash Card), etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device. The memory 81 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program. The memory 81 may also be used to temporarily store data that has been output or is to be output.
  • a boot loader BootLoader
  • the technicians in the relevant field can clearly understand that for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example for illustration.
  • the above-mentioned function allocation can be completed by different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
  • the functional units and modules in the embodiment can be integrated in a processing unit, or each unit can exist physically separately, or two or more units can be integrated in one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or in the form of software functional units.
  • An embodiment of the present application also provides a network device, which includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor implements the steps in any of the above-mentioned method embodiments when executing the computer program.
  • An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments can be implemented.
  • An embodiment of the present application provides a computer program product.
  • the terminal device can implement the steps in the above-mentioned method embodiments when executing the computer program product.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the present application implements all or part of the processes in the above-mentioned embodiment method, which can be completed by instructing the relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program is executed by the processor, the steps of the above-mentioned method embodiments can be implemented.
  • the computer program includes computer program code, which can be in source code form, object code form, executable file or some intermediate form.
  • the computer-readable medium may at least include: any entity or device that can carry the computer program code to the camera/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunication signal and software distribution medium.
  • USB flash drive mobile hard disk, disk or optical disk.
  • computer-readable media cannot be electric carrier signals and telecommunication signals.
  • the disclosed devices/network equipment and methods can be implemented in other ways.
  • the device/network equipment embodiments described above are merely schematic.
  • the division of the modules or units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Abstract

本申请适用于图像识别技术领域,提供了基于卷积神经网络模型的图像识别方法、装置及终端设备,其中,上述卷积神经网络基于待识别图像的权重矩阵对上述待识别图像进行全局特征提取,上述图像识别方法包括:将待识别图像输入到经训练的上述卷积神经网络模型,通过上述卷积神经网络模型对上述待识别图像依次进行特征提取和识别,得到识别结果。本申请可以提高卷积神经网络对不同尺寸图像的图像识别效果。

Description

基于卷积神经网络模型的图像识别方法、装置及终端设备
本申请要求于2022年10月13日提交中国专利局,申请号为202211254852.4、发明名称为“基于卷积神经网络模型的图像识别方法、装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于图像识别技术领域,尤其涉及基于卷积神经网络模型的图像识别方法、装置、终端设备以及计算机可读存储介质。
背景技术
卷积神经网络是一种特殊的人工神经网络,它已经成为语音分析和图像识别领域最为常用的一种技术,因此也是人工神经网络研究热点。在卷积神经网络通常是一个多层的神经网络,每层由多个特征图(featureMap)组成,而每个特征图由多个独立神经元组成,同一特征图的神经元共享权重(即卷积核),卷积神经网络正是通过共享权重减少网络各层的连接,同时又降低了过拟合的风险。
现有的卷积神经网络在进行特征提取时通常采用一个小的核矩阵(即权重矩阵)和大特征图作卷积操作,以节约参数,但其共用一个静态的权重矩阵,不能很好地适应不同尺寸的输入图像。
发明内容
本申请实施例提供了基于卷积神经网络模型的图像识别方法、装置及终端设备,可以提高卷积神经网络模型对不同尺寸图像的图像识别效果。
第一方面,本申请实施例提供了一种基于卷积神经网络模型的图像识别方法,上述卷积神经网络模型基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取,上述图像识别方法包括:
将待识别图像输入到经训练的所述卷积神经网络模型,通过所述卷积神经网络模型对所述待识别图像依次进行特征提取和识别,得到识别结果。
第二方面,本申请实施例提供了一种卷积神经网络模型训练方法,包括:
获取构建的卷积神经网络模型,并将样本图像输入到上述卷积神经网络进行训练,直至上述卷积神经网络满足预设要求,得到卷积神经网络模型;
其中,上述卷积神经网络基于样本图像的权重矩阵对上述样本图像进行全局特征提取。
第三方面,本申请实施例提供了一种图像识别装置,包括:
输入模块和经训练的卷积神经网络模型,所述卷积神经网络模型基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取;
所述输入模块,用于将待识别图像输入到所述卷积神经网络模型;
所述卷积神经网络模型,用于对所述待识别图像依次进行特征提取和识别,得到识别结果。
第四方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于卷积神经网络模型的图像识别方法或上述第二方面所述的卷积神经网络模型训练方法的步骤。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面中所述的基于卷积神经网络模型的图像识别方法或上述第二方面所述的卷积神经网络模型训练方法的步骤。
第六方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的基于卷积神经网络模型的图像识别方法或上述第二方面所述的卷积神经网络模型训练方法。
本申请实施例与现有技术相比存在的有益效果是:
本申请实施例中,将待识别图像输入到经训练的所述卷积神经网络模型,通过所述卷积神经网络模型对所述待识别图像依次进行特征提取和识别,得到识别结果。由于该卷积神经网络模型基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取,因此,在对不同尺寸的待识别图像进行图像识 别时,可以根据各个待识别图像动态调整权重矩阵来对该待识别图像进行全局特征的提取,使得卷积神经网络模型对不同尺寸的输入图像具有良好的识别效果。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1是本申请一实施例提供的一种基于卷积神经网络模型的图像识别方法的流程示意图;
图2是本申请实施例提供的卷积神经网络模型的结构示意图;
图3是本申请实施例提供的用于动态提取权重矩阵的卷积模块的结构示意图;
图4是本申请实施例提供的第二卷积模块的结构示意图;
图5是本申请实施例提供的卷积神经网络模型训练方法的流程示意图;
图6是本申请实施例提供的基于卷积神经网络模型的图像识别装置的结构示意图;
图7是本申请实施例提供的卷积神经网络模型训练装置的结构示意图;
图8是本申请实施例提供的终端设备的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或” 是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。
实施例一:
图1示出了本发明实施例提供的一种基于卷积神经网络模型的图像识别方法的流程示意图,详述如下:
将待识别图像输入到经训练的卷积神经网络模型,通过上述卷积神经网络模型对上述待识别图像依次进行特征提取和识别,得到识别结果。
其中,上述卷积神经网络模型基于待识别图像的权重矩阵对该待识别图像进行全局特征提取,以进行图像识别,即,根据当前输入的待识别图像的权重矩阵对该待识别图像进行全局特征的提取。
上述权重矩阵是指卷积神经网络的卷积核。在图像处理时,给定一个输入图像中一个小区域中像素加权平均后成为输出图像中的每个对应像素,其中,权值由一个函数定义,该函数即卷积核,对于不同的输入图像,卷积神经网络的权重矩阵(卷积核)是固定不变的。
具体地,由于在进行图像识别任务时,待识别图像的大小不一,采用同一个固定的权重矩阵,不能随着输入图像动态调整权重矩阵,并且,传统卷积操作仅拥有局部感受野,不能很好地提取图像的全局特征,影响图像识别的准确度,因此,将待识别图像输入到上述卷积神经网络模型进行特征提取的过程中,根据该待识别图像的权重矩阵来对该待识别图像进行全局特征的提取,以根据得到的特征进行识别,得到图像识别结果,使上述卷积神经网络模型同时具有全局感受野和动态权重,从而提高卷积神经网络模型的识别准确度。例如,对待识别图像A,根据待识别图像A的权重矩阵a对待识别 图像A进行全局特征的提取,并根据得到的全局特征进行识别,得到待识别图像A的识别结果,对待识别图像B进行识别时,则根据待识别图像B的权重矩阵b来对待识别图像B进行全局特征的提取和识别。
本申请实施例中,将待识别图像输入到经训练的卷积神经网络模型,通过上述卷积神经网络模型对上述待识别图像依次进行特征提取和识别,得到识别结果。由于上述卷积神经网络模型在特征提取的过程中,根据待识别图像的权重矩阵对该待识别图像进行全局特征的提取,卷积神经网络模型同时具有全局感受野和动态权重,因此,可以较好地对不同大小的待识别图像进行特征提取,从而提高卷积神经网络模型的识别准确度。
在一些实施例中,上述基于卷积神经网络模型的图像识别方法还包括:
获取待识别图像。
可选地,上述待识别图像可以是摄像设备采集得到的一个图像,也可以是摄像设备采集得到的视频流中的一个图像帧。
可选地,由于不同图像识别任务所需的待识别图像可能不同,所采用的摄像设备、采集待检测图像的规则等也不尽相同,因此,根据各应用领域相应的采集方法和采集规则等,获取相应的待识别图像。例如,对于行人重识别任务,需要获取安装的多个摄像头所采集的图像作为待识别图像,以进行行人重识别任务。
本申请实施例中,根据各应用领域中图像识别任务所需的图像,采用相应的采集方法和采集规则,获取符合图像识别任务要求的待识别图像,以便进行图像识别任务。
在一些实施例中,上述卷积神经网络模型包括特征提取模块和识别模块,上述步骤在通过上述卷积神经网络模型对上述待识别图像依次进行特征提取和识别,得到识别结果,包括:
A1、通过上述特征提取模块对上述待识别图像进行特征提取;
A2、基于上述识别模块对提取到的特征进行识别,得到识别结果。
可选地,由于图像识别包括图像分类、目标检测等不同的任务(如人脸识别、行人检测等),不同的图像识别任务在进行图像识别时,对同一特征所采用的识别方法不尽相同,因此,通过上述特征提取模块对输入的待识别图像进行特征提取,并将提取到的特征作为上述识别模块的输入,根据图像 识别任务进行相应的识别,得到识别结果。上述识别模块中可包括一个或多个识别单元,不同的识别单元所进行的识别任务不同,例如,可以包括人脸识别单元和目标检测单元,通过人脸识别单元对提取到的特征进行人脸识别任务,或将提取到的特征分布输入人脸识别单元和目标检测单元,进行人脸识别和目标检测任务。
本申请实施例中,通过特征提取模块对待识别图像进行特征提取,并通过识别模块获取提取到的特征进行相应的识别,得到相应的识别结果,以提高各图像识别任务的识别效率。
在一些实施例中,上述特征提取模块包括第一卷积模块和第二卷积模块,步骤A1包括:
A11、基于上述第一卷积模块对上述待识别图像进行局部特征提取,得到局部特征图;
A12、基于上述第二卷积模块根据上述局部特征图的权重矩阵对上述局部特征图进行全局特征提取,得到全局特征图。
可选地,上述卷积神经网络模型可以基于现有的卷积神经网络进行构建,将浅层的卷积层作为第一卷积模块,并将深层的卷积层或自注意力替换为第二卷积模块,上述第一卷积模块采用普通卷积方式对待识别图像进行卷积处理,提取待识别图像的局部特征,输出待识别图像的局部特征图,并将该局部特征图作为第二卷积模块的输入,上述第二模块通过动态卷积的方式,获取局部特征图的权重矩阵并根据该权重矩阵对局部特征图进行全局特征的提取,得到全局特征图。例如,如图2所示的卷积神经网络模型,前两层为普通卷积的第一卷积模块,较深层的四个卷积层为第二卷积模块,第二卷积模块与识别模块相连,将待识别图像作为第一卷积模块的输入进行局部特征提取,并将输出的特征输入到第二卷积模块进行全局特征的提取,上述第二卷积模块输出的全局特征图作为识别模块的输入进行识别,从而输出相应的识别结果。
需要说明的是,上述卷积神经网络模型中第一卷积模块和第二卷积模块也可以采用交叉出现的结构,即第一卷积模块与第二卷积模块连接,该第二卷积模块的输出与另一个第一卷积模块连接的结构,也可以采用图2所示结构多次堆叠的结构,使特征提取模块输出的特征图为第二卷积模块所提取的 全局特征图即可(即,使识别模块基于第二卷积模块提取的全局特征进行识别),并不该限定卷积神经网络模型中第一卷积模块(普通卷积层)与本申请实施例提供的第二卷积模块的具体结构。
本申请实施例中,由于通过卷积神经网络模型的第一卷积模块提取待识别图像的局部特征,并将得到的局部特征图作为第二卷积模块的输入,根据上述局部特征图的权重矩阵对上述局部特征图进行全局特征提取,因此,得到的全局特征同时包含局部特征和全局特征,提高了卷积神经网络模型的识别准确度。
在一些实施例中,上述第二卷积模块包括第一分支和第二分支,上述步骤A12包括:
将上述局部特征图沿通道方向拆分,得到第一局部特征图和第二局部特征图,并将上述第一局部特征图和上述第二局部特征图分别输入到上述第一分支和上述第二分支。
可选地,在第二卷积模块对输入的局部特征图进行全局特征提取时,先将局部特征图沿通道方向进行拆分(如沿通道方向均匀拆分为两份),得到第一局部特征图和第二局部特征图,在将第一局部特征图输入到第一分支,将第二局部特征图输入到第二分支,以分别对第一局部特征图和第二局部特征图进行全局特征的提取。
上述第一分支和上述第二分支分别根据输入的局部特征图的权重矩阵对相应的局部特征图进行特征提取,得到第一全局特征图和第二全局特征图。
可选地,第一分支根据第一局部特征图生成上述第一局部特征图的权重矩阵,并根据上述权重矩阵对上述第一局部特征图进行全局特征的提取,得到第一全局特征图;第二分支根据第二局部特征图生成上述第二局部特征图的权重矩阵,根据上述权重矩阵对上述第二局部特征图进行全局特征的提取,得到第二全局特征图。
将上述第一全局特征图和上述第二全局特征图沿通道方向进行拼接,得到全局特征图。
可选地,由于第一局部特征图和第二局部特征图是经局部特征图沿通道方向拆分后得到的,因此,在第一分支和第二分支分别提取得到第一局部特征图和第二局部特征图的特征后,将得到的第一全局特征图和第二全局特征 图沿通道方向进行拼接,得到完整的全局特征图,以根据输入的待识别图像的完整的全局特征图进行后续处理。
本申请实施例中,由于将局部特征图沿通道方向拆分成两份输入到第一分支和第二分支,局部特征图的通道数减半,因此,基于通道数减半的局部特征图进行全局特征的提取,降低了全局特征提取的计算复杂度,从而降低对设备计算能力的要求,便于在算力较低的设备上部署应用。
在一些实施例中,第一分支和第二分支根据输入的局部特征图进行全局特征的提取时,包括:
根据上述第一局部特征图生成列维度的权重矩阵,得到第一权重矩阵,并根据上述第一权重矩阵对上述第一局部特征图沿列维度进行卷积处理,得到第一全局特征图。
具体地,在第一分支对第一局部特征图进行特征提取时,根据上述第一局部特征图沿列维度(即H维度)进行卷积处理,动态生成上述第一局部特征图的权重矩阵,得到第一权重矩阵,再根据上述第一权重矩阵对上述第一局部特征图进行全局特征提取,得到第一全局特征图。
可选地,为了减小卷积神经网络模型的计算复杂度,在根据上述第一局部特征图动态生成列维度的第一权重矩阵时,对上述第一局部特征图沿列维度进行一维卷积处理,即在高度方向上对每一行像素进行卷积操作,例如,第一局部特征图表示为(B,C,H,W),对上述第一局部特征图采用轻量级的一维卷积模块沿列维度进行卷积处理,生成上述第一局部特征图的第一权重矩阵(B,C,H,1)。由于第一权重矩阵是在列维度上对每一行像素进行卷积操作得到的,该第一权重矩阵具有全局感受野,因此,在得到上述第一局部特征图的第一权重矩阵后,根据上述第一权重矩阵对上述第一局部特征图沿列维度进行循环卷积,提取上述第一局部特征图的全局特征,得到第一全局特征图。例如,假设第一局部特征图的高度为4,批处理大小为1,由于对上述第一特征图沿列维度进行卷积操作得到第一权重矩阵,因此,将第一局部特征图表示为x=(x0,x1,x2,x3),第一权重矩阵(相当于卷积核)表示为w=(w0,w1,w2,w3),即相当于将第一局部特征图的每一行像素当作一个元素进行处理,使第一局部特征图和第一权重矩阵具有相同的分辨率(如尺寸为(C,H,1))。
可选地,在对上述第一局部特征图进行一维卷积处理以获取第一权重矩阵时,采用深度可分离卷积进行处理,以减少参数量和计算量,例如,如图3所示,一维卷积模块采用两层卷积层的轻量级网络结构,对第一局部特征图(B,C,H,W)沿列维度进行平均池化,使第一局部特征图降维,得到(B,C,H,1)形状的第一局部特征图,再对上述第一局部特征图采用1×3卷积核的深度可分离卷积,并采用HardSwish激活函数和批量归一化处理,再进行1×3深度可分离卷积,输出第一权重矩阵,其中,HardSwish激活函数是对Swish非线性激活函数的改进,Swish激活函数在一定程度上可以提高卷积神经网络的准确性,但其计算成本较高,不适合在嵌入式移动设备上使用,HardSwish激活函数可以实现为分段功能以减少内存访问次数,在提高卷积神经网络准确性的同时便于在嵌入式移动设备上部署使用。
根据上述第二分支的局部特征图生成行维度的权重矩阵,得到第二权重矩阵,并根据上述第二权重矩阵对上述第二局部特征图沿行维度进行卷积处理,得到第二全局特征图。
可选地,在第二分支对第二局部特征图进行特征提取时,根据上述第二局部特征图动态生成上述第二局部特征图的权重矩阵,得到第二权重矩阵,再根据上述第二权重矩阵和第二局部特征图进行卷积处理,得到第二全局特征图。
可选地,在根据上述第二局部特征图生成第二权重矩阵时,对上述第二局部特征图沿行维度(即水平维度)进行一维卷积处理,即在宽度方向上对每一列像素进行卷积操作,例如,第二局部特征图表示为(B,C,H,W),对上述第二局部特征图采用轻量级的一维卷积模块沿W维度进行卷积处理,生成上述第二局部特征图的第二权重矩阵(B,C,1,W)。在得到上述第二局部特征图的第二权重矩阵后,根据上述第二权重矩阵对上述第二局部特征图沿行维度进行循环卷积,提取上述第二局部特征图的全局特征,得到第二全局特征图。
可选地,在对上述第二局部特征图进行一维卷积处理以获取第二权重矩阵时,可与第一分支一样,采用深度可分离卷积进行处理,以减少参数量和计算量。例如,一维卷积模块采用如图所示的轻量级网络结构,其中,在进行平均池化时沿行维度进行平均池化,得到(B,C,1,W)形状的第二权 重矩阵,1×3深度可分离卷积替换为3×1深度可分离卷积。
本申请实施例中,由于从不同维度采用一维的权重矩阵对局部特征图进行循环卷积处理,得到不同维度的全局特征,并将得到的全局特征进行拼接得到包含宽度方向和高度方向的全局特征,减少了提取局部特征图的全局特征时的计算量,同时,根据第一局部特征图和第二局部特征图分别动态提取其权重矩阵,再根据其权重矩阵对相应的局部特征图进行全局特征提取,使得卷积神经网络模型对于不同尺度的待识别图像具有良好的适应性,提高卷积神经网络模型的图像识别效果。
在一些实施例中,上述第二卷积模块还包括位置嵌入模块,上述步骤在将上述局部特征图沿通道方向进行拆分之前,还包括:
通过上述位置嵌入模块根据上述局部特征图的权重矩阵对上述局部特征图进行特征提取,得到位置特征图,并将上述位置特征图与上述局部特征图相加,得到包含位置特征的局部特征图。
可选地,在将根据局部特征图的权重矩阵对该局部特征图进行全局特征提取之前,通过位置嵌入模块对该局部特征图进行卷积处理,动态生成该局部特征图的权重矩阵并根据该权重矩阵对该局部特征图进行局部特征提取,生成位置特征图,并将该位置特征图与该局部特征图相加,得到嵌入了位置特征的局部特征图。
可选地,上述位置嵌入模块为二维卷积模块,对输入的样本图像进行卷积处理生成二维的位置特征图,即,使位置特征图的大小与输入的局部特征图的分辨率大小一致,以便直接将上述位置特征图与局部特征图相加,得到包含位置特征的局部特征图。例如,位置嵌入模块采用两层的轻量级卷积网络结构,即“卷积+归一化处理+激活函数+卷积”的简单结构,由于需要生成二维的位置特征图,因此,卷积层可以采用3×3的深度可分离卷积来对局部特征图进行卷积处理生成位置特征图。
本申请实施例中,由于图像的位置特征可以加强对图像内容的描述区分能力,在根据局部特征图提取全局特征之前,根据局部特征图的权重矩阵提取该局部特征图的位置特征,并将该位置特征嵌入到局部特征图中,使该局部特征图包含位置特征,因此,后续基于包含位置特征的全局特征图进行图像识别时可以提高识别准确度。
在一些实施例中,第二卷积模块的网络结构如图4所示,可包括位置嵌入模块、第一分支和第二分支,通过位置嵌入模块将局部特征图的位置特征嵌入局部特征图后,将嵌入了位置特征的局部特征图拆分,得到第一局部特征图和第二局部特征图并分别输入到第一分支和第二分支中,第一分支和第二分支分别基于行维度和列维度,根据第一局部特征图和第二局部特征图的权重矩阵对其进行全局特征提取,得到第一全局特征图和第二全局特征图,再将第一全局特征图和第二全局特征图进行拼接,得到全局特征图。此时,第一分支根据第一权重矩阵对第一局部特征图沿列维度进行循环卷积处理,提取第一局部特征图的全局特征时,上述循环卷积操作可表示为:
Figure PCTCN2022142241-appb-000001
w H=f(w,H)
x P=x+f(pe,H)
其中,y i表示输出的第一全局特征的第i个元素,H是指第一局部特征图的行数(即高度),
Figure PCTCN2022142241-appb-000002
是指第一权重矩阵中的第k个权重(卷积核),其中w H是指若第一权重矩阵的长度不是H,则对第一权重矩阵进行插值操作,使其长度为H,(()) H是指对第一局部特征图进行H点的周期延拓,
Figure PCTCN2022142241-appb-000003
是指对输入x(第一局部特征图)取第k+i个元素,x P是指加上位置特征的第一局部特征图,p即pe,代表与第一局部特征图对应的位置特征。
对应于上述基于卷积神经网络模型的图像识别方法,图5示出了本申请实施例提供的一种卷积神经网络模型训练方法的流程示意图,详述如下:
获取构建的卷积神经网络模型,并将样本图像输入到上述卷积神经网络进行训练,直至上述卷积神经网络满足预设要求,得到卷积神经网络模型。
可选地,根据用户需求(如需进行人脸识别任务,则卷积神经网络需要设置分类头对人脸图像进行分类识别)构建卷积神经网络,即根据用户需求设置卷积神经网络的网络结构(如可以基于现有的VGGNet、AlexNet的网络结构进行构建),并将构建的卷积神经网络进行初始。如,为了缓解卷积神经网络层数较深时存在的梯度消失和梯度爆炸的问题,基于残差网络ResNet18的网络结构构建卷积神经网络,其中,残差网络ResNet18包括17 个卷积层和与1个全连接层。
具体地,获取预先构建的卷积神经网络,将相应的样本图像作为上述卷积神经网络的输入进行训练,直至上述卷积神经网络满足预设的要求(如卷积神经网络的识别准确度到达预设阈值,如0.988),则停止训练卷积神经网络,得到训练完成的卷积神经网络模型。其中,在进行卷积神经网络的训练时,卷积神经网络根据输入图像的权重矩阵来对输入图像进行全局特征提取,上述输入图像即输入的样本图像。
可选地,上述卷积神经网络根据输入图像的权重矩阵来对输入图像进行全局特征提取之前,通过卷积生成输入图像的权重矩阵,从而根据该权重矩阵对该输入图像进行全局特征的提取。需要指出的是,由于上述卷积神经网络采用动态权重,并不是固定的权重矩阵,因此,在卷积神经网络的训练过程中,根据损失函数对卷积神经网络的参数进行更新时,对权重矩阵进行更新时,是对卷积生成输入图像的权重矩阵的卷积模块的参数进行更新。
可选地,上述样本图像为用户进行的图像识别任务所对应的、带标签的样本图像,以便直接采用上述样本图像进行训练,不需要再对样本图像进行标注。例如在训练用于进行人脸识别的卷积神经网络模型时,采用CelebA人脸属性数据集作为训练集,CelebA包含了10177个身份的202599张人脸图像,每张图像都作了特征标记,可用于作为卷积神经网络的输入对卷积神经网络进行训练。在采用样本图像进行训练时,可将部分样本图像作为训练集进行训练,将一部分样本图像作为验证集和测试集,以调整卷积神经网络,得到良好的卷积神经网络模型。
可选地,在将样本图像作为卷积神经网络的输入进行训练时,单次训练输入的样本图像可以是一张,也可以是多张,如100张,批量输入样本图像时,输入的样本图像的数量表示为批处理大小Batch Size,相应的,卷积神经网络提取到的样本图像的特征形状可以表示为四维格式(B,H,W,C),B表示批处理大小,H表示高度,W表示宽度,C表示通道channel。
本申请实施例中,根据用户需求构建卷积神经网络,将带标签的样本图像作为上述卷积神经网络的输入进行训练,直至上述卷积神经网络满足预设要求,得到训练完成的卷积神经网络模型,由于上述卷积神经网络包括用于根据输入图像的权重矩阵对输入图像进行全局特征提取的动态卷积模块,因 此,在卷积神经网络训练过程中,根据各个样本图像所对应的权重矩阵对各样本图像进行特征提取,各个样本图像不需要采用同一个静态的权重矩阵,使得卷积神经网络在训练过程中能够很好地适应不同尺度的样本图像,减少对卷积神经网络进行多尺度训练的难度,并使得卷积神经网络模型对多尺度图像具备良好的识别效果。
对应于上述基于卷积神经网络模型的图像识别方法或卷积神经网络模型训练方法,以下基于部分应用场景对基于卷积神经网络模型的图像识别方法进行介绍。
(一)行人检测
行人检测一直是计算机视觉研究中的热点和难点,行人检测所需要解决的问题是:找出图像或视频帧中所有的行人,包括其位置和大小,一般用矩形框表示。行人检测可以与行人跟踪、行人重识别等技术结合,应用于无人驾驶、智能交通等领域,因此,行人检测技术具有很强的应用价值。由于行人检测所检测的待识别图像大小不一,而本申请提供的基于卷积神经网络模型的图像识别方法,正是针对图像尺寸大小不一的问题,通过动态获取各个待识别图像的权重矩阵对待识别图像进行全局特征的提取,从而进行行人检测,提高行人检测的准确度。
首先,将采集到的待识别图像输入卷积神经网络模型(行人检测模型),通过第一卷积模块对该待识别图像进行局部特征的提取,以获取待识别图像的边缘、角点、线等特征信息,得到局部特征图,再将该局部特征图作为第二卷积模块的输入。由于行人检测是检测图像中的行人,行人的特征(如身体、头)之间具有较强的空间关系,因此,通过位置嵌入模块对局部特征图进行卷积处理,得到该局部特征图的位置特征图,并将其与局部特征图相加,使局部特征图包含位置信息,再将该局部特征图沿通道方向进行拆分得到第一局部特征图和第二局部特征图,输入到第一分支和第二分支,沿列维度和行维度对第一局部特征图、第二局部特征图进行卷积处理,得到该待识别图像的权重矩阵,再根据权重矩阵对其相应的局部特征图进行卷积处理,得到第一局部特征图和第二局部特征图对应的第一全局特征图和第二全局特征图,并将其沿通道方向拼接,得到完整的全局特征图。识别模块获取第二卷 积模块输出的全局特征图,根据该全局特征图检测存在的行人特征,从而输出相应的检测结果。
(二)目标检测
在目标检测任务中,待测图像和被测目标的大小经常是不固定的,例如自动驾驶相关的检测任务可能同时需要检测大卡车和动物,医疗病灶检测任务可能需要同时检测大小不一的病灶,在被测物体尺度相差较大时,通常通过获取不同尺寸的待识别图像检测不同大小的目标,模型通常难以很好地适应对应大小差距大的检测,此时,基于本申请提供的卷积神经网络模型,可以根据输入的待识别图像动态调整权重矩阵,根据各个待识别图像各自的权重矩阵对该待识别图像进行检测,实现较好的检测效果。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
实施例二:
对应于上文实施例所述的基于卷积神经网络模型的图像识别方法,图6示出了本申请实施例提供的图像识别装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图6,该装置包括:输入模块61、卷积神经网络模型62,上述卷积神经网络模型基于待识别图像的权重矩阵对上述待识别图像进行全局特征提取。其中,
输入模块61,用于将待识别图像输入到上述卷积神经网络模型;
卷积神经网络模型62,用于对上述待识别图像依次进行特征提取和识别,得到识别结果。
本申请实施例中,将待识别图像输入到经训练的卷积神经网络模型,通过上述卷积神经网络模型对上述待识别图像依次进行特征提取和识别,得到识别结果。由于上述卷积神经网络模型在特征提取的过程中,根据待识别图像的权重矩阵对该待识别图像进行全局特征的提取,卷积神经网络模型同时具有全局感受野和动态权重,因此,可以较好地对不同大小的待识别图像进行特征提取,从而提高卷积神经网络模型的识别准确度。
在一些实施例中,上述图像识别装置还包括:
待识别图像获取模块,用于获取待识别图像。
在一些实施例中,上述卷积神经网络模型62包括:
特征提取单元,用于对上述待识别图像进行特征提取;
识别单元,用于对提取到的特征进行识别,得到识别结果。
在一些实施例中,上述特征提取单元包括:
第一卷积单元,用于对上述待识别图像进行局部特征提取,得到局部特征图;
第二卷积单元,用于根据上述局部特征图的权重矩阵对上述局部特征图进行全局特征提取,得到全局特征图。
在一些实施例中,上述第二卷积单元包括:
拆分单元,用于将上述局部特征图沿通道方向拆分,得到第一局部特征图和第二局部特征图;
第一分支单元,用于根据第一局部特征图的权重矩阵对第一局部特征图进行全局特征提取,得到第一全局特征图;
第二分支单元,用于根据第二局部特征图的权重矩阵对第二局部特征图进行全局特征提取,得到第二全局特征图;
拼接单元,用于将上述第一全局特征图和上述第二全局特征图沿通道方向进行拼接,得到全局特征图。
在一些实施例中,上述第一分支单元包括:
第一权重单元,用于对第一局部特征图像进行卷积处理,生成第一权重矩阵;
全局特征提取单元,用于根据上述第一权重矩阵对第一局部特征图进行全局特征提取,得到第一全局特征图;
上述第二分支单元包括:
第二权重单元,用于对第二局部特征图像进行卷积处理,生成第二权重矩阵;
全局特征提取单元,用于根据上述第二权重矩阵对第二局部特征图进行全局特征提取,得到第二全局特征图。
在一些实施例中,上述第二卷积单元还包括:
位置嵌入单元,用于根据上述局部特征图的权重矩阵对上述局部特征图进行特征提取,得到位置特征图,并将上述位置特征图与上述局部特征图相加,得到包含位置特征的局部特征图。
对应于上文实施例所述的卷积神经网络模型训练方法,图7示出了本申请实施例提供的卷积神经网络模型训练装置的结构框图,参照图7,该装置包括:
训练模块71,用于获取构建的卷积神经网络模型,并将样本图像输入到上述卷积神经网络进行训练,直至上述卷积神经网络满足预设要求,得到卷积神经网络模型,其中,上述卷积神经网络基于样本图像的权重矩阵对所述样本图像进行全局特征提取。
本申请实施例中,根据用户需求构建卷积神经网络,将带标签的样本图像作为上述卷积神经网络的输入进行训练,直至上述卷积神经网络满足预设要求,得到训练完成的卷积神经网络模型,由于上述卷积神经网络包括用于根据输入图像的权重矩阵对输入图像进行全局特征提取的动态卷积模块,因此,在卷积神经网络训练过程中,根据各个样本图像所对应的权重矩阵对各样本图像进行特征提取,各个样本图像不需要采用同一个静态的权重矩阵,使得卷积神经网络在训练过程中能够很好地适应不同尺度的样本图像,减少对卷积神经网络进行多尺度训练的难度,并使得卷积神经网络模型对多尺度图像具备良好的识别效果。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
实施例三:
图8为本申请一实施例提供的终端设备的结构示意图。如图8所示,该实施例的终端设备8包括:至少一个处理器80(图8中仅示出一个处理器)、存储器81以及存储在所述存储器81中并可在所述至少一个处理器80上运行的计算机程序82,所述处理器80执行所述计算机程序82时实现上述任意各个方法实施例中的步骤。
示例性的,所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述终端设备81中的执行过程。例如,所述计算机程序82可以被分割成输入模块61和卷积神经网络模型62,上述卷积神经网络模型基于待识别图像的权重矩阵对上述待识别图像进行全局特征提取,各模块之间具体功能如下:
输入模块61,用于将待识别图像输入到上述卷积神经网络模型;
卷积神经网络模型62,用于对上述待识别图像依次进行特征提取和识别,得到识别结果。
或者,所述计算机程序82可以被分割成训练模块71,模块具体功能如下:
训练模块71,用于获取构建的卷积神经网络模型,并将样本图像输入到上述卷积神经网络进行训练,直至上述卷积神经网络满足预设要求,得到卷积神经网络模型,其中,上述卷积神经网络基于样本图像的权重矩阵对所述样本图像进行全局特征提取。
所述终端设备8可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备可包括,但不仅限于,处理器80、存储器81。本领域技术人员可以理解,图8仅仅是终端设备8的举例,并不构成对终端设备8的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),该处理器80还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器81在一些实施例中可以是所述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。所述存储器81在另一些实施例中也可以是所述终端设备8的外部存储设备,例如所述终端设备8上配备的插接式硬盘, 智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述终端设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储操作系统、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种网络设备,该网络设备包括:至少一个处理器、存储器以及存储在所述存储器中并可在所述至少一个处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任意各个方法实施例中的步骤。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现可实现上述各个方法实施例中的步骤。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中, 该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技 术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种基于卷积神经网络模型的图像识别方法,其特征在于,所述卷积神经网络模型基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取;
    所述图像识别方法包括:
    将待识别图像输入到经训练的所述卷积神经网络模型,通过所述卷积神经网络模型对所述待识别图像依次进行特征提取和识别,得到识别结果。
  2. 如权利要求1所述的图像识别方法,其特征在于,所述卷积神经网络模型包括特征提取模块和识别模块,所述通过所述卷积神经网络模型对所述待识别图像依次进行特征提取和识别,得到识别结果,包括:
    通过所述特征提取模块对所述待识别图像进行特征提取;
    基于所述识别模块对提取到的特征进行识别,得到识别结果。
  3. 如权利要求2所述的图像识别方法,其特征在于,所述特征提取模块包括第一卷积模块和第二卷积模块,所述第二卷积模块基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取,所述通过所述特征提取模块对所述待识别图像进行特征提取,包括:
    基于所述第一卷积模块对所述待识别图像进行局部特征提取,得到局部特征图;
    基于所述第二卷积模块根据所述局部特征图的权重矩阵对所述局部特征图进行全局特征提取,得到全局特征图。
  4. 如权利要求3所述的图像识别方法,其特征在于,所述第二卷积模块包括第一分支和第二分支,所述基于所述第二卷积模块根据所述局部特征图的权重矩阵对所述局部特征图进行全局特征提取,得到全局特征图,包括:
    将所述局部特征图沿通道方向拆分,得到第一局部特征图和第二局部特征图,并将所述第一局部特征图和所述第二局部特征图分别输入到所述第一分支和所述第二分支;
    所述第一分支和所述第二分支分别根据输入的局部特征图的权重矩阵对相应的局部特征图进行特征提取,得到第一全局特征图和第二全局特征图;
    将所述第一全局特征图和所述第二全局特征图沿通道方向进行拼接,得到全局特征图。
  5. 如权利要求4所述的图像识别方法,其特征在于,所述第一分支和所述第二分支分别根据输入的局部特征图的权重矩阵对相应的局部特征图进行特征提取,得到第一全局特征图和第二全局特征图,包括:
    根据所述第一局部特征图生成列维度的权重矩阵,得到第一权重矩阵,并根据所述第一权重矩阵对所述第一局部特征图沿列维度进行卷积处理,得到第一全局特征图;
    根据所述第二分支的输入图像生成行维度的权重矩阵,得到第二权重矩阵,并根据所述第二权重矩阵对所述第二局部特征图沿行维度进行卷积处理,得到第二全局特征图。
  6. 如权利要求4所述的图像识别方法,其特征在于,所述第二卷积模块还包括位置嵌入模块,在所述将所述局部特征图沿通道方向进行拆分之前,还包括:
    通过所述位置嵌入模块根据所述局部特征图的权重矩阵对所述局部特征图进行特征提取,得到位置特征图,并将所述位置特征图与所述局部特征图相加,得到包含位置特征的局部特征图。
  7. 一种卷积神经网络模型训练方法,其特征在于,包括:
    获取构建的卷积神经网络模型,并将样本图像输入到所述卷积神经网络进行训练,直至所述卷积神经网络满足预设要求,得到卷积神经网络模型;
    其中,所述卷积神经网络基于样本图像的权重矩阵对所述样本图像进行全局特征提取。
  8. 一种图像识别装置,其特征在于,包括:
    输入模块和经训练的卷积神经网络模型,所述卷积神经网络模型基于待识别图像的权重矩阵对所述待识别图像进行全局特征提取;
    所述输入模块,用于将待识别图像输入到所述卷积神经网络模型;
    所述卷积神经网络模型,用于对所述待识别图像依次进行特征提取和识别,得到识别结果。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述的图像识别方法或如权利要求7所述的卷积神经网络模型训练方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述的图像识别方法或如权利要求7所述的卷积神经网络模型训练方法。
PCT/CN2022/142241 2022-10-13 2022-12-27 基于卷积神经网络模型的图像识别方法、装置及终端设备 WO2024077781A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211254852.4A CN115578590A (zh) 2022-10-13 2022-10-13 基于卷积神经网络模型的图像识别方法、装置及终端设备
CN202211254852.4 2022-10-13

Publications (1)

Publication Number Publication Date
WO2024077781A1 true WO2024077781A1 (zh) 2024-04-18

Family

ID=84585951

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142241 WO2024077781A1 (zh) 2022-10-13 2022-12-27 基于卷积神经网络模型的图像识别方法、装置及终端设备

Country Status (2)

Country Link
CN (1) CN115578590A (zh)
WO (1) WO2024077781A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116309582B (zh) * 2023-05-19 2023-08-11 之江实验室 一种便携式超声扫描图像的识别方法、装置及电子设备
CN116645365B (zh) * 2023-07-21 2023-11-17 锋睿领创(珠海)科技有限公司 基于频谱的石英玻璃检测方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229379A (zh) * 2017-12-29 2018-06-29 广东欧珀移动通信有限公司 图像识别方法、装置、计算机设备和存储介质
CN109726725A (zh) * 2018-12-28 2019-05-07 中南大学 一种基于大间隔类间互异性多核学习的油画作者识别方法
CN112101318A (zh) * 2020-11-17 2020-12-18 深圳市优必选科技股份有限公司 基于神经网络模型的图像处理方法、装置、设备及介质
CN112215840A (zh) * 2020-10-30 2021-01-12 上海商汤临港智能科技有限公司 图像检测、行驶控制方法、装置、电子设备及存储介质
WO2022036777A1 (zh) * 2020-08-21 2022-02-24 暨南大学 基于卷积神经网络的人体动作姿态智能估计方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229379A (zh) * 2017-12-29 2018-06-29 广东欧珀移动通信有限公司 图像识别方法、装置、计算机设备和存储介质
CN109726725A (zh) * 2018-12-28 2019-05-07 中南大学 一种基于大间隔类间互异性多核学习的油画作者识别方法
WO2022036777A1 (zh) * 2020-08-21 2022-02-24 暨南大学 基于卷积神经网络的人体动作姿态智能估计方法及装置
CN112215840A (zh) * 2020-10-30 2021-01-12 上海商汤临港智能科技有限公司 图像检测、行驶控制方法、装置、电子设备及存储介质
CN112101318A (zh) * 2020-11-17 2020-12-18 深圳市优必选科技股份有限公司 基于神经网络模型的图像处理方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115578590A (zh) 2023-01-06

Similar Documents

Publication Publication Date Title
Zhang et al. RGB-D face recognition via deep complementary and common feature learning
CN109344701B (zh) 一种基于Kinect的动态手势识别方法
WO2024077781A1 (zh) 基于卷积神经网络模型的图像识别方法、装置及终端设备
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
US9996755B2 (en) Method and image processing apparatus for image-based object feature description
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN108846404B (zh) 一种基于相关约束图排序的图像显著性检测方法及装置
WO2024001123A1 (zh) 基于神经网络模型的图像识别方法、装置及终端设备
WO2019033570A1 (zh) 嘴唇动作分析方法、装置及存储介质
WO2022205937A1 (zh) 特征信息提取方法、模型训练方法、装置及电子设备
WO2022127111A1 (zh) 跨模态人脸识别方法、装置、设备及存储介质
Demirkus et al. Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos
CN110852311A (zh) 一种三维人手关键点定位方法及装置
Hossain et al. A systematic review of machine learning techniques for cattle identification: Datasets, methods and future directions
CN113673584A (zh) 一种图像检测方法及相关装置
Albattah et al. Custom CornerNet: a drone-based improved deep learning technique for large-scale multiclass pest localization and classification
Zhu et al. Text detection based on convolutional neural networks with spatial pyramid pooling
Sujanaa et al. Emotion recognition using support vector machine and one-dimensional convolutional neural network
Wang et al. Multi-level feature fusion model-based real-time person re-identification for forensics
Li et al. Fast recognition of pig faces based on improved Yolov3
WO2024077785A1 (zh) 基于卷积神经网络模型的图像识别方法、装置及终端设备
Dulal et al. Automatic Cattle Identification using YOLOv5 and Mosaic Augmentation: A Comparative Analysis
Mareta et al. Herbal leaf classification using images in natural background
Mu et al. Finding autofocus region in low contrast surveillance images using CNN-based saliency algorithm
Chandra et al. A novel method for CNN training using existing color datasets for classifying hand postures in Bayer images