WO2018120740A1

WO2018120740A1 - Picture classification method, device and robot

Info

Publication number: WO2018120740A1
Application number: PCT/CN2017/092044
Authority: WO
Inventors: 刘若鹏; 徐磊; 欧阳一村
Original assignee: 深圳光启合众科技有限公司; 深圳光启创新技术有限公司
Priority date: 2016-12-29
Filing date: 2017-07-06
Publication date: 2018-07-05
Also published as: CN108256544A; CN108256544B

Abstract

Disclosed by the present invention are a picture classification method, device and a robot. The method comprises: inputting a target picture to be classified into a convolution layer of a convolution neural network, the convolution neural network at least comprising a convolution layer and a pooling layer; performing a convolution operation on a first matrix according to a preset first convolution kernel so as to obtain a first vector, the first vector being a one-dimensional vector, and the first matrix being the output of the last pooling layer of the convolution neural network; performing a convolution operation on the first vector according to a preset second convolution kernel so as to obtain a second vector, the second vector being a one-dimensional vector; and classifying the target picture according to the second vector. The present invention solves the technical problem in existing technology wherein a network has high requirements for the performance of a terminal due to too many parameters of a full connection layer.

Description

Picture classification method and device, robot

Technical field

The present invention relates to the field of image processing, and in particular to a picture classification method and apparatus, and a robot.

Background technique

Current researchers have invested a lot of energy in researching neural networks, including deep belief networks, deep Boltzmann machines, automatic encoders, denoising encoders, and convolutional neural networks. In these networks, the fully connected layer is a very important component. For example, the deep belief network, all layers of the network such as the automatic encoder are all connected layers; the convolutional neural network will add several layers of fully connected layers to obtain better Classification accuracy. The main problems of the full connection layer are: too many parameters lead to high performance requirements of the network to the terminal.

In response to the above problems, no effective solution has been proposed yet.

Summary of the invention

The embodiments of the present invention provide a picture classification method and apparatus, and a robot, to solve at least the technical problem that the parameters of the full connection layer in the prior art are too large, resulting in high performance requirements of the network to the terminal.

According to an aspect of an embodiment of the present invention, a picture classification method is provided, including: inputting a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolutional neural network includes at least one convolution a layer and a pooling layer; performing a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, wherein the first vector is a one-dimensional vector, and the first matrix is the volume An output of a last pooling layer of the neural network; performing a convolution operation on the first vector according to a second convolution kernel set in advance to obtain a second vector, wherein the second vector is a one-dimensional vector; The second vector classifies the target picture.

Further, performing a convolution operation on the first matrix according to the first convolution kernel set in advance, and obtaining the first vector includes: performing convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix; All elements of the second matrix are rearranged in a predetermined order to obtain the first vector.

Further, before performing a convolution operation on the first matrix according to the preset first convolution kernel, the method further includes: acquiring a training sample, wherein the training sample includes a plurality of pictures pre-divided into categories, The class is used to characterize the kind of the thing indicated by the training sample; the convolutional neural network is trained according to the training sample to obtain the first convolution kernel.

Further, training the convolutional neural network according to the training sample, and obtaining the first convolution kernel comprises: inputting the training sample into a convolution layer of the convolutional neural network; according to an initial state The convolution kernel performs a convolution operation on the first matrix to obtain a first vector; performing a convolution operation on the first vector according to the second convolution kernel to obtain a second vector; according to the target in the second vector The value of the element determines a classification result of the training sample, wherein a value of the target element in the second vector indicates a probability size of a category of the second vector that is the same as a category corresponding to the target element, wherein the target An element is any one of the second vectors; comparing the classification result with a category of each of the pictures to obtain a classification error value; determining whether the classification error value is greater than a preset error value; The classification error value is greater than the preset error value, and the weight value of the convolution kernel of the initial state is adjusted until the classification error value is less than or equal to the preset error value; If the classification error value is less than or equal to the preset error value, the training ends and the current convolution kernel is used as the first convolution kernel.

Further, the first convolution kernel satisfies the following formula:

Where m is the input vector dimension in the convolutional neural network, n is the output vector dimension, stride is the step size of the first convolution kernel, n _oc is the number of output channels, and n _conv is the first convolution kernel size.

According to another aspect of the embodiments of the present invention, there is further provided a picture classification apparatus, comprising: an input unit, configured to input a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolution The neural network includes at least one convolution layer and one pooling layer; the first operation unit is configured to perform a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, where the first vector a first-dimensional vector, the first matrix is an output of a last pooling layer of the convolutional neural network, and a second computing unit is configured to convolve the first vector according to a second convolution kernel set in advance The second vector is obtained by the operation, wherein the second vector is a one-dimensional vector; and the classification unit is configured to classify the target image according to the second vector.

Further, the first operation unit includes: a first operation subunit, configured to perform a convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix; and arrange a subunit for All elements of the second matrix are rearranged in a predetermined order to obtain the first vector.

Further, the device further includes: an acquiring unit, configured to acquire a training sample before the first operation unit performs a convolution operation on the first matrix according to the preset first convolution kernel, wherein the training sample includes Pre-dividing a plurality of pictures of a category, the category is used to represent a category of the thing indicated by the training sample; and a training unit is configured to train the convolutional neural network according to the training sample to obtain the A convolution kernel.

Further, the training unit includes: an input subunit for inputting the training sample to the volume a second operation subunit, configured to perform a convolution operation on the first matrix according to a convolution kernel of an initial state to obtain a first vector; and a third operation subunit, configured to The second convolution kernel performs a convolution operation on the first vector to obtain a second vector. The first determining subunit is configured to determine a classification result of the training sample according to a value of the target element in the second vector, where The value of the target element in the second vector indicates a probability size of the category of the second vector that is the same as the category corresponding to the target element, wherein the target element is any one of the second vectors; a subunit, configured to compare the classification result with each category of the picture to obtain a classification error value; and determine a subunit for determining whether the classification error value is greater than a preset error value; adjusting the subunit, And if the classification error value is greater than the preset error value, adjusting a weight value of the convolution kernel of the initial state until the classification error value is less than or equal to the preset error value; Determining a subunit, if the classification error value is less than or equal to the preset error value, the training ends, and the current convolution kernel is used as the first convolution kernel.

Further, the first convolution kernel satisfies the following formula:

According to another aspect of an embodiment of the present invention, there is also provided a robot comprising: the above picture classification device.

In the embodiment of the present invention, the target picture is a picture to be classified, the target picture is input into the convolution layer of the convolutional neural network, and the last pooled layer of the convolutional neural network outputs the first matrix, according to the first convolution check A matrix is convoluted to obtain a first vector, and a convolution operation is performed on the first vector according to the second convolution kernel to obtain a second vector. The value of each element of the second vector can indicate the probability size of the target image as a certain category. Therefore, the target image can be classified by the second vector, the number of parameters of the full connection layer is reduced, the requirement for the performance of the terminal is reduced, and the network can be deployed on a mobile phone or other embedded systems, thereby achieving a reduction in the pair. The technical effect of the requirements of the terminal performance of the network is solved, thereby solving the technical problem that the parameters of the full connection layer in the prior art are too large, resulting in high performance requirements of the network to the terminal.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a flowchart of a picture classification method according to an embodiment of the present invention;

2 is a schematic diagram of a picture classification device according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

Example 1

In accordance with an embodiment of the present invention, an embodiment of a picture classification method is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although The logical order is shown in the flowcharts, but in some cases the steps shown or described may be performed in a different order than the ones described herein.

FIG. 1 is a flowchart of a picture classification method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:

Step S102, input a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolutional neural network includes at least one convolution layer and one pooling layer.

Step S104, performing a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, where the first vector is a one-dimensional vector, and the first matrix is the last pooled layer of the convolutional neural network. Output.

Step S106, performing a convolution operation on the first vector according to the second convolution kernel set in advance to obtain a second vector, wherein the second vector is a one-dimensional vector.

Step S108, classifying the target picture according to the second vector.

Image classification is to input a picture, output the category corresponding to the picture (dog, cat, boat, bird), or the most likely to output which category the picture belongs to.

Give the computer a picture to sort the picture by entering a number filled with pixel values into the computer. Group, each number in the array is 0-255, representing the pixel value at that point. Let it return the possible classification probability corresponding to this array (for example, dog 0.01, cat 0.04, boat 0.94, bird 0.02).

Humans can distinguish the picture of a ship by the characteristics of the ship's edge, lines and so on. Similarly, a computer that distinguishes a ship's image is also judged by these underlying features, such as image edges and image outlines in the image, and then a more abstract concept is created by convolutional neural networks.

The first layer in a convolutional neural network is always a convolutional layer. As mentioned earlier, the input to the convolutional layer is an array full of pixel values, if it is an array of 28*28*3 (3 is the RGB value). Think of a convolutional layer as a beam of light that shines on a picture. This beam is called a filter (convolution core), and the place where the beam shines is called the receptive zone. Assume that the range illuminated by this beam is a 5*5 square area. Now let the beam sweep from left to right and from top to bottom across every area of the picture. When all the moves are complete, I get an array of 24*24*3. Call this array a feature image.

This filter is an array of numeric types (the numbers inside are some weight values). The depth of the filter is the same as the depth of the input. So the dimension of the filter is 5*5*3. Using the 5*5*3 filter, you can get the output array 24*24*3. If you use more filters, you can get more feature images.

When the filter sweeps or convolves the entire image, the weight value in the filter is multiplied by the corresponding pixel value in the real picture, and finally all the results are summed to obtain an added value. Then repeat this process, scan the entire input image (the next step moves the filter one unit to the right, then move one step to the right, and so on), and each step gets a value.

In the process of convolution, if there is a shape in the picture similar to the shape represented by the filter, it will generate an incentive effect with the filter, and the sum of the resulting product results will be a large number. Additional filters can be added to detect the edges and colors of the image, and so on. The more filters, the more feature maps, and the more information you get from the input data.

The pooling layer is usually used after the convolutional layer. The role of the pooling layer is to simplify the information output in the convolutional layer, reduce the data dimension, reduce the computational overhead, and control the overfitting.

In the prior art, the output of the last pooling layer is connected to a plurality of fully connected layers, and the parameters of the plurality of fully connected layers are very large.

From the perspective of linear algebra, the fully connected layer maps one vector space to another vector space. If the rank of the mapping matrix W is greater than the input vector dimension, the information will not be lost. From the perspective of linear algebra, the convolution operation is to complete the above mapping operation, and in the general manner used by the convolutional neural network, the rank of the mapping matrix W generated by the convolution operation is generally min{m,n},m. For the input vector dimension, n is the output vector dimension; therefore, from a vector space perspective, this means that the information loss of the convolution operation is only related to the output vector dimension. For the number of parameters of the two, the number of parameters required for the fully connected layer is m*n, and the number of parameters required for the convolutional layer is n _oc *n _conv , where n _oc is the number of output channels, and n _conv is The convolution kernel size, in fact, these parameters also have to satisfy the following equation, stride is the distance each time the convolution moves.

That is to say, the number of parameters required for convolution is m*n _oc -n*stride, which is less than m*n of the number of fully connected parameters. Here n _{oc is} generally only a fraction of a to a few tenths of n. Han et al. used parametric pruning to reduce the number of fully connected layer parameters. In their experiments, the fully connected layer parameters can be compressed to 20%, which means that in the fully connected parameter matrix, there are only 20% non-zero elements; This experiment also demonstrates the rationality of using one-dimensional convolution.

In the embodiment of the present invention, the output of the last pooling layer is connected to the first convolution kernel, and the convolution operation is used to obtain the second vector. The value of each element of the second vector can indicate the probability size of the target image as a certain category. , representing the possible classification probability of the picture. For example, the second vector is [0.01, 0.04, 0.94, 0.02], and a higher value indicates that these feature images are more similar to that one. Here 0.94 represents a 94% probability that the image is a ship, indicating that the predicted picture and the filter produce a high level of excitation, acquiring a lot of high-level features, such as sails, oars and so on. 0.02 means that the probability of the image being a bird is 2%, indicating that the predicted picture and the filter produce a very low stimulus, and many high-level features such as wings and scorpions are not acquired.

In the embodiment of the present invention, the target picture is a picture to be classified, the target picture is input into the convolution layer of the convolutional neural network, and the last pooled layer of the convolutional neural network outputs the first matrix, according to the first convolution check A matrix is convoluted to obtain a first vector, and a convolution operation is performed on the first vector according to the second convolution kernel to obtain a second vector. The value of each element of the second vector can indicate the probability size of the target image as a certain category. Therefore, the target image can be classified by the second vector, the number of parameters of the full connection layer is reduced, the requirement for the performance of the terminal is reduced, and the network can be deployed on a mobile phone or some other embedded system, thereby solving the existing In the technology, too many parameters of the full connection layer lead to technical problems that require high performance of the network to the terminal, and the technical effect of reducing the requirements on the performance of the terminal of the deployed network is achieved.

Optionally, performing a convolution operation on the first matrix according to the preset first convolution kernel, and obtaining the first vector includes: performing a convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix; and using the second matrix; All elements are rearranged in a preset order to get the first vector.

Compressed sensing is a very popular field because of its good data recovery capability. From the perspective of coding, it is a new lossy compression coding method. The idea of compressed sensing is to use a common coding method and a special decoding method. Codec process. In theory, compress a vector x obeying the distribution p(x), compressed perceptual coding One possibility is to randomly sample this vector, use the L1 norm as the cost equation for decoding, and restore the original vector. It can be proved that if the distribution meets certain conditions, the error between the recovered vector and the original vector can be small.

Another reason for using rearrangement is that, because of the local nature of the convolution kernel, only the vector produced by convolution has only local information, no global information, and after rearrangement, convolution of the new vector, according to the above-mentioned compression-aware nature, The convolved vector contains all the information of the original vector, and such convolution results have better global characteristics.

Optionally, before the convolution operation is performed on the first matrix according to the preset first convolution kernel, the method further includes: acquiring the training samples, wherein the training samples include a plurality of pictures pre-divided into categories, and the categories are used for characterizing the training. The type of things indicated by the sample; the convolutional neural network is trained according to the training samples to obtain the first convolution kernel.

The first convolution kernel is trained. The specific training process may be as follows: input the training sample into the convolution layer of the convolutional neural network; convolute the first matrix according to the convolution kernel of the initial state to obtain the first vector; and perform the first vector according to the second convolution kernel Convolution operation, obtaining a second vector; determining a classification result of the training sample according to the value of the target element in the second vector, wherein the value of the target element in the second vector indicates a probability that the category of the second vector is the same as the category corresponding to the target element The size, wherein the target element is any one of the second vectors; comparing the classification result with the category of each picture to obtain a classification error value; determining whether the classification error value is greater than a preset error value; if the classification error value is greater than the pre- Set the error value, adjust the weight value of the convolution kernel in the initial state until the classification error value is less than or equal to the preset error value; if the classification error value is less than or equal to the preset error value, the training ends and the current convolution kernel As the first convolution kernel.

The loss function helps the network update the weight values to find the desired feature image. There are many ways to define the loss function. One common method is MSE (mean squared erro) mean square error. The loss value is obtained by substituting the true classification value of the picture and the classification value trained by the network into the mean square error error formula. This loss value may be high when the network is just starting to train because the weight values are randomly initialized. The ultimate goal is to get the predicted value and the real value. In order to achieve this goal, it is necessary to minimize the loss value. The smaller the loss value, the closer the prediction result is. In this process, it is necessary to constantly adjust the weight values to find out which weight values can reduce the loss of the network. You can use the gradient descent algorithm to find these weight values.

Optionally, performing a convolution operation on the first vector according to the second convolution kernel, and obtaining the second vector includes: acquiring a preset second convolution kernel, where the second convolution kernel is a one-dimensional vector; The second convolution kernel performs a convolution operation on the first vector to obtain a second vector.

Example 2

In order to reduce the number of parameters of the fully connected layer, the following scheme is used:

1. The output reshape of the previous layer is a one-dimensional vector.

2. Rearrange the vectors in a fixed order; the 2D vectors passing through the sliding of the convolution kernel are sequentially arranged into a one-dimensional vector.

3. Do one-dimensional convolution on this vector.

That is to say, the number of parameters required for convolution is m*n _oc -n*stride, which is less than m*n of the number of fully connected parameters. Here n _{oc is} generally only a fraction of a to a few tenths of n. Han et al. used parametric pruning to reduce the number of fully connected layer parameters. In their experiments, the fully connected layer parameters can be compressed to 20%, which means that in the fully connected parameter matrix, there are only 20% non-zero elements; This experiment also demonstrates the rationality of using one-dimensional convolution. Most of the parameters are 0, indicating the sparsity of the parameters, and the parameter sharing of the convolution greatly reduces the amount of parameters.

Compressed sensing is a very popular field because of its good data recovery capability. From the perspective of coding, it is a new lossy compression coding method. The idea of compressed sensing is to use a common coding method and a special decoding method. Codec process. In theory, to compress a vector x obeying the distribution p(x), one possible way of compressing perceptual coding is to randomly sample this vector, using the L1 norm as the cost equation for decoding, and restoring the original vector, which can be proved if the distribution Depending on certain conditions, the error between the recovered vector and the original vector can be small.

In the current network, the commonly used classifier is softmax, and there is also a fully connected layer before the classifier; this fully connected layer is different from other fully connected layers in that its output dimension is determined, and the number of categories must be The same (the output dimensions of the other fully connected layers are generally given by experience freely, without the required requirements). Directly using the above process, there will be a problem that the output dimension and the number of categories do not match after convolution, so here is one full Connection layer.

The end result is that the entire network has only a fully connected layer before softmax, which greatly reduces the parameters of the network. This is illustrated by experiments on Mnist.

The mnist is a handwritten font dataset containing 0-9 handwritten digital data sets and labels. The accuracy rate is the probability that the network correctly recognizes the handwritten digits.

Experimental results on Mnist:

方法method	参数个数Number of parameters
SoftmaxSoftmax	78407840
Conv+fc+softmaxConv+fc+softmax	32735043273504
Conv+fc+softmaxConv+fc+softmax	85568556

Example 3

According to an embodiment of the invention, a picture classification device is also provided. The picture classification device may perform the picture classification method described above, and the picture classification method may be implemented by the picture classification device.

2 is a schematic diagram of a picture classification device according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes an input unit 10, a first arithmetic unit 20, a second arithmetic unit 30, and a sorting unit 40.

The input unit 10 is configured to input a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolutional neural network includes at least one convolution layer and one pooling layer.

The first operation unit 20 is configured to perform a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, where the first vector is a one-dimensional vector, and the first matrix is the last of the convolutional neural network. The output of a pooled layer.

The second operation unit 30 is configured to perform a convolution operation on the first vector according to the second convolution kernel set in advance to obtain a second vector, where the second vector is a one-dimensional vector.

The classification unit 40 is configured to classify the target picture according to the second vector.

Optionally, the first operation unit 20 includes: a first operation subunit, and an arrangement subunit. The first operation subunit is configured to perform a convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix. Arranging subunits for rearranging all elements of the second matrix in a predetermined order to obtain a first vector.

Optionally, the device further includes: an acquiring unit and a training unit. An obtaining unit, configured to acquire a training sample before the first operation unit 20 performs a convolution operation on the first matrix according to the first convolution kernel, wherein the training sample includes a plurality of pictures pre-divided into categories, and the category is used to represent the training sample The type of thing indicated. A training unit is configured to train the convolutional neural network according to the training sample to obtain a first convolution kernel.

Optionally, the training unit includes: an input subunit, a second operation subunit, a third operation subunit, a first determination subunit, a comparison subunit, a judgment subunit, an adjustment subunit, and a second determination subunit. An input subunit for inputting training samples into the convolutional layer of the convolutional neural network. And a second operation subunit, configured to perform a convolution operation on the first matrix according to the convolution kernel of the initial state to obtain a first vector. And a third operation subunit, configured to perform a convolution operation on the first vector according to the second convolution kernel to obtain a second vector. a first determining subunit, configured to determine, according to a value of the target element in the second vector, a classification result of the training sample, where the value of the target element in the second vector indicates a probability that the category of the second vector is the same as the category corresponding to the target element , wherein the target element is any one of the second vectors. The comparison subunit is used to compare the classification result with the category of each picture to obtain a classification error value. The determining subunit is configured to determine whether the classification error value is greater than a preset error value. The subunit is adjusted to adjust the weight value of the convolution kernel in the initial state until the classification error value is less than or equal to the preset error value if the classification error value is greater than the preset error value. The second determining subunit is configured to end the training if the classification error value is less than or equal to the preset error value, and use the current convolution kernel as the first convolution kernel.

Optionally, the first convolution kernel satisfies the following formula:

Where m is the input vector dimension in the convolutional neural network, n is the output vector dimension, stride is the step size of the first convolution kernel, n _oc is the number of output channels, and n _conv is the size of the first convolution kernel.

In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are only schematic. For example, the division of the unit may be a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to implement the solution of the embodiment. the goal of.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims

A picture classification method, comprising:

Inputting a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolutional neural network includes at least one convolution layer and one pooling layer;

Performing a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, wherein the first vector is a one-dimensional vector, and the first matrix is the last pool of the convolutional neural network Output of the layer;

Performing a convolution operation on the first vector according to a preset second convolution kernel to obtain a second vector, where the second vector is a one-dimensional vector;

The target picture is classified according to the second vector.
The method according to claim 1, wherein the first matrix is convoluted according to the first convolution kernel set in advance, and the first vector is obtained:

Performing a convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix;

All elements of the second matrix are rearranged in a predetermined order to obtain the first vector.
The method according to claim 1, wherein before the convolution operation on the first matrix according to the first convolution kernel set in advance, the method further comprises:

Obtaining a training sample, wherein the training sample includes a plurality of pictures of a pre-divided category, the category being used to characterize a kind of the thing indicated by the training sample;

The convolutional neural network is trained according to the training sample to obtain the first convolution kernel.
The method according to claim 3, wherein the convolutional neural network is trained according to the training sample, and obtaining the first convolution kernel comprises:

Importing the training sample into a convolutional layer of the convolutional neural network;

Performing a convolution operation on the first matrix according to a convolution kernel of an initial state to obtain a first vector;

Performing a convolution operation on the first vector according to the second convolution kernel to obtain a second vector;

Determining a classification result of the training sample according to a value of a target element in the second vector, wherein a value of the target element in the second vector indicates that a category of the second vector is the same as a category corresponding to the target element a probability size, wherein the target element is any one of the second vectors;

Comparing the classification result with each category of the picture to obtain a classification error value;

Determining whether the classification error value is greater than a preset error value;

If the classification error value is greater than the preset error value, adjusting a weight value of the convolution kernel of the initial state until the classification error value is less than or equal to the preset error value;

If the classification error value is less than or equal to the preset error value, the training ends and the current convolution kernel is used as the first convolution kernel.
The method according to any one of claims 1 to 4, wherein the first convolution kernel satisfies the following formula:

Where m is the input vector dimension in the convolutional neural network, n is the output vector dimension, stride is the step size of the first convolution kernel, n oc is the number of output channels, and n conv is the first convolution kernel size.
A picture classification device, comprising:

An input unit, configured to input a target picture to be classified into a convolution layer of a convolutional neural network, wherein the convolutional neural network includes at least one convolution layer and one pooling layer;

a first operation unit, configured to perform a convolution operation on the first matrix according to the preset first convolution kernel to obtain a first vector, where the first vector is a one-dimensional vector, and the first matrix is the volume The output of the last pooled layer of the neural network;

a second operation unit, configured to perform a convolution operation on the first vector according to a preset second convolution kernel to obtain a second vector, where the second vector is a one-dimensional vector;

a classifying unit, configured to classify the target picture according to the second vector.
The apparatus according to claim 6, wherein the first arithmetic unit comprises:

a first operation subunit, configured to perform a convolution operation on the first matrix according to the first convolution kernel to obtain a second matrix;

Arranging subunits for rearranging all elements of the second matrix in a predetermined order to obtain the first vector.
The device according to claim 6, wherein the device further comprises:

An acquiring unit, configured to acquire a training sample before the first operation unit performs a convolution operation on the first matrix according to a preset first convolution kernel, where the training sample includes a plurality of pictures pre-divided into categories The category is used to characterize the type of thing indicated by the training sample;

And a training unit, configured to train the convolutional neural network according to the training sample to obtain the first convolution kernel.
The apparatus of claim 8 wherein said training unit comprises:

An input subunit, configured to input the training sample into a convolution layer of the convolutional neural network;

a second operation subunit, configured to perform a convolution operation on the first matrix according to a convolution kernel of an initial state to obtain a first vector;

a third operation subunit, configured to perform a convolution operation on the first vector according to the second convolution kernel to obtain a second vector;

a first determining subunit, configured to determine, according to a value of the target element in the second vector, a classification result of the training sample, where a value of the target element in the second vector indicates a category and a location of the second vector a probability size of the same category corresponding to the target element, wherein the target element is any one of the second vectors;

a comparison subunit, configured to compare the classification result with a category of each of the pictures to obtain a classification error value;

a determining subunit, configured to determine whether the classification error value is greater than a preset error value;

Adjusting the subunit, if the classification error value is greater than the preset error value, adjusting a weight value of the convolution kernel of the initial state until the classification error value is less than or equal to the preset error value;

And a second determining subunit, configured to end the training if the classification error value is less than or equal to the preset error value, and use the current convolution kernel as the first convolution kernel.
The apparatus according to any one of claims 6 to 9, wherein said first convolution kernel satisfies the following formula:

Where m is the input vector dimension in the convolutional neural network, n is the output vector dimension, stride is the step size of the first convolution kernel, n oc is the number of output channels, and n conv is the first convolution kernel size.
A robot comprising the picture classification device according to any one of claims 6 to 10.