CN112183672A

CN112183672A - Image classification method, and training method and device of feature extraction network

Info

Publication number: CN112183672A
Application number: CN202011227392.7A
Authority: CN
Inventors: 苏驰; 李凯; 刘弘也; 王育林
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-01-05

Abstract

The invention provides an image classification method, a training method and a training device of a feature extraction network, wherein a target image is input into the feature extraction network which is trained in advance, and image features of the target image are output; determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies. In the method, the image features comprise the hierarchical features of at least two hierarchies, so that the feature hierarchies contained in the image features are richer, and the method can be applied to image recognition in complex scenes such as live broadcast and the like, so that the sensitive images can be accurately and effectively recognized, and the omission factor and the false detection rate of the sensitive images are reduced.

Description

Image classification method, and training method and device of feature extraction network

Technical Field

The invention relates to the technical field of image processing, in particular to an image classification method, a training method of a feature extraction network and a device.

Background

In order to supervise the content of the video transmitted in the network, sensitive images, such as vulgar images, pornographic images, violent images, horror images and the like, need to be identified from the video. In the related art, the sensitive images can be identified through a deep learning method, specifically, high-level semantic features of the images are extracted through a pre-trained detection model, and then the images are classified based on the high-level semantic features. However, in a live broadcast scene, the source and the structure of the video image are more complex, and the above method is difficult to accurately and effectively identify the sensitive image, so that the omission factor and the false detection factor of the sensitive image are high.

Disclosure of Invention

In view of the above, the present invention provides an image classification method, a training method for a feature extraction network, and an apparatus thereof, so as to improve the accuracy of identifying a sensitive image.

In a first aspect, an embodiment of the present invention provides an image classification method, where the method includes: inputting the target image into a pre-trained feature extraction network, and outputting the image features of the target image; determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies.

Furthermore, the feature extraction network also comprises at least two feature processing modules and a feature fusion module; each feature processing module is connected with a hierarchical feature extraction layer; the feature extraction layers connected with any two feature processing modules are different; inputting a target image into a pre-trained feature extraction network, and outputting the image features of the target image, wherein the step comprises the following steps: processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism through each feature processing module, and outputting intermediate features; and fusing the intermediate features output by each feature processing module through the feature fusion module to obtain the image features.

Further, the at least two feature processing modules include: the system comprises a first characteristic processing module, a second characteristic processing module and a third characteristic processing module; the first feature processing module is connected with the feature extraction layer of the lowest level; the feature extraction layer of the lowest level is used for inputting a target image; the second feature processing module is connected with the specified feature extraction layer in the middle level; the third feature processing module is connected with the feature extraction layer of the highest level.

Further, the characteristic processing module comprises a pooling layer, a first full-connection layer and a characteristic multiplication layer; processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism, and outputting intermediate features, wherein the step comprises the following steps of: performing first pooling on the input hierarchical features through a pooling layer, and outputting a first pooling result; performing first full-connection processing on the first pooling result through the first full-connection layer, and outputting a first full-connection result; multiplying the input hierarchical features and the first full-connection result through a feature multiplication layer to obtain a multiplication result; and outputting the intermediate feature based on the multiplication result.

Further, the feature processing module further comprises a spatial pyramid pooling layer; the spatial pyramid pooling layer is connected with the feature multiplication layer; a step of outputting an intermediate feature based on the multiplication result, including: and performing second pooling on the multiplication result through the spatial pyramid pooling layer, and outputting the intermediate feature of the specified dimension.

Further, the feature fusion module comprises a feature splicing layer, a second full connection layer and a third full connection layer; through the feature fusion module, fusing the intermediate features output by each feature processing module to obtain the image features, wherein the step comprises the following steps of: splicing the intermediate features output by each feature processing module through the feature splicing layer, and outputting splicing features; performing second full-connection processing on the splicing characteristics through a second full-connection layer, and outputting a second full-connection result; and performing third full-connection processing on the second full-connection result through a third full-connection layer, and outputting image characteristics.

Further, the step of determining the category of the target image based on the image features comprises: inputting the image characteristics into a preset normalization index function, and outputting a probability distribution vector; the probability distribution vector comprises a plurality of categories and probability values corresponding to the categories; and determining the category corresponding to the maximum probability value in the probability distribution vector as the category of the target image.

In a second aspect, an embodiment of the present invention provides a method for training a feature extraction network, where the method includes: determining a training sample based on a preset sample set; the training sample comprises a sample image and a class label of the sample image; inputting the sample image into a feature extraction network to obtain sample features of the sample image; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies; determining a category identification result of the sample image based on the sample characteristics; determining a loss value based on a preset classification loss function, a class label and a class identification result; updating the network parameters of the feature extraction network based on the loss values; and continuing to execute the step of determining the training sample based on the preset sample set until the loss value is converged to obtain the trained feature extraction network.

In a third aspect, an embodiment of the present invention provides an image classification apparatus, including: the output module is used for inputting the target image into a pre-trained feature extraction network and outputting the image features of the target image; the classification module is used for determining the category of the target image based on the image characteristics; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies.

In a fourth aspect, an embodiment of the present invention provides a training apparatus for a feature extraction network, where the apparatus includes: the sample determining module is used for determining a training sample based on a preset sample set; the training sample comprises a sample image and a class label of the sample image; the image input module is used for inputting the sample image into the feature extraction network to obtain the sample feature of the sample image; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies; the parameter updating module is used for determining a category identification result of the sample image based on the sample characteristics; determining a loss value based on a preset classification loss function, a class label and a class identification result; updating the network parameters of the feature extraction network based on the loss values; and the network determining module is used for continuously executing the step of determining the training sample based on the preset sample set until the loss value is converged to obtain the trained feature extraction network.

In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the image classification method of the first aspect or the training method of the feature extraction network of the second aspect.

In a sixth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the image classification method of the first aspect or the training method of the feature extraction network of the second aspect.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides an image classification method and a training method and device of a feature extraction network, wherein a target image is input into the feature extraction network which is trained in advance, and image features of the target image are output; determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies. In the method, the image features comprise the hierarchical features of at least two hierarchies, so that the feature hierarchies contained in the image features are richer, and the method can be applied to image recognition in complex scenes such as live broadcast and the like, so that the sensitive images can be accurately and effectively recognized, and the omission factor and the false detection rate of the sensitive images are reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an image classification method according to an embodiment of the present invention;

FIG. 2 is a flowchart of another image classification method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a feature extraction network according to an embodiment of the present invention;

fig. 4 is a flowchart of a training method for a feature extraction network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an image classification apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a training apparatus for a feature extraction network according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the development of network technology and intelligent mobile platforms, live broadcast and mobile live broadcast have become deep into the daily life of people; if the video transmitted in the network is not monitored, the video is easy to become a transmission means of obscene pornography and violence, so that vast netizens are harmed; in order to perform content supervision on videos propagated in a network, sensitive images need to be identified from the videos; however, because the number of live broadcast platforms is huge, manual supervision often takes time and labor, and a large amount of cost is consumed; in the traditional method, sensitive images can be identified through a feature matching algorithm, but the live broadcast environment is diversified, the illumination change is strong, the resolution ratio is low, and the human body posture difference is obvious, so that accurate classification cannot be achieved through a simple feature matching algorithm; in addition, the training sample size is too small, the training method is too simple, and sensitive images with complex and various contents cannot be really recognized.

In the related art, the sensitive images can be identified through a deep learning method, such as a convolutional neural network, which has achieved a good result in the image identification field, specifically, a pre-trained detection model is used to extract high-level semantic features of the images, and then the images are classified based on the high-level semantic features. However, in a live broadcast scene, video images have various sources and complex structures, and the above method is difficult to accurately and effectively identify the sensitive images, so that the omission factor and the false detection factor of the sensitive images are high. Based on this, the image classification method, the training method of the feature extraction network and the device provided by the embodiment of the invention can be applied to devices such as mobile phones and computers, and especially can be applied to devices with network live broadcast or network video playing functions.

To facilitate understanding of the present embodiment, a detailed description will be first given of an image classification method disclosed in the present embodiment, as shown in fig. 1, the method includes the following steps:

step S102, inputting a target image into a pre-trained feature extraction network, and outputting image features of the target image;

the target image may be a video image propagated through a network, a video image in a live webcast platform, or the like, for example, a live scene image, which is generally an image containing a person; the pre-trained feature extraction Network may be a Network model such as CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), DNN (Deep Neural Networks), and the like; the network may generally comprise a plurality of layers of convolutional networks, and may further comprise a plurality of activation functions, etc. The image features generally include multi-level features of the target image, and may include one or more of, for example, underlying (color, texture, etc.) features, intermediate (shape, etc.) features, or higher (semantic, etc.) features of the target image. The image features may be feature vectors.

Specifically, the target image may be represented as X ∈ R^H×W×3Wherein H represents the height of the target image, W represents the width of the target image, and 3 represents that the target image is a three-channel image; the target image X with the size of H multiplied by W multiplied by 3 belongs to R^H×W×3Inputting the image into the feature extraction network, extracting at least two hierarchical features corresponding to two hierarchies through a plurality of hierarchical feature extraction layers, wherein the hierarchical features are at least two of bottom layer (color, texture and the like), middle layer (shape, action and the like) or high layer (semantic and the like) features of the target image, then fusing the at least two hierarchical features to obtain image features, and specifically performing feature fusion in a splicing mode and the like.

Step S104, determining the category of the target image based on the image characteristics; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies.

The categories of the target images can include various types, such as normal images and sensitive images; but also normal images, vulgar images, pornographic images and violent images, wherein the vulgar images, the pornographic images and the violent images belong to sensitive images; in practical implementation, the image features include multi-level features of the target image, such as colors of a background and a person in the target image, shapes of objects, semantics of characters, and the like, so that the probability of each type of the target image can be determined by calculating the probability according to the image features, and finally the type of the target image can be determined according to the calculated probability of each type. The classification of the target image can also be obtained through a classifier according to the image characteristics.

The multi-level feature extraction layers at least comprise two, and can also comprise three, four or more, generally, the more feature extraction layers are, the more hierarchical features of the finally extracted target image are abundant, the better the performance is, the same feature extraction time is increased, and the speed is slower; therefore, the number of layers of the feature extraction layer can be specifically set according to the classification speed and the precision requirement of the actual requirement. Each of the feature extraction layers may include a plurality of convolutional networks (which may also be referred to as convolutional layers) and a plurality of activation functions; the convolution layer is used for convolution operation, the convolution operation aims at extracting different characteristics of a target image, the convolution layer of the first layer can only extract some low-level image characteristics such as edges, lines, angles and other levels, and the convolution layer of the more layers can iteratively extract more complex image characteristics from the low-level characteristics. Therefore, for a plurality of levels of feature extraction layers, image information of the level features output by each level of feature extraction layer is different; for example, the hierarchical features output by the low-level feature extraction layer include features such as background color and texture of a target image; the hierarchical features output by the feature extraction layer of the middle hierarchy comprise the features of the shape, the character action, the character skin color, the area and the like of each object in the target image; the hierarchical features output by the feature extraction layer at a higher level comprise the features of the semantics of characters in the target image and the like; and fusing the hierarchical features corresponding to different hierarchies to obtain the image features of the target image. By the method, various hierarchical features can be extracted, so that the finally fused image features are richer.

The image classification method provided by the embodiment of the invention inputs a target image into a pre-trained feature extraction network and outputs the image features of the target image; determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies. In the method, the image features comprise the hierarchical features of at least two hierarchies, so that the feature hierarchies contained in the image features are richer, and the method can be applied to image recognition in complex scenes such as live broadcast and the like, so that the sensitive images can be accurately and effectively recognized, and the omission factor and the false detection rate of the sensitive images are reduced.

The embodiment further provides another image classification method, which is implemented on the basis of the above embodiment, and the embodiment mainly describes a specific implementation process of a step of inputting a target image into a pre-trained feature extraction network and outputting an image feature of the target image (implemented by steps S202 to S204), and a specific implementation process of a step of determining a category of the target image based on the image feature (implemented by steps S206 to S208);

in this embodiment, the feature extraction network further includes at least two feature processing modules and a feature fusion module; each feature processing module is connected with a hierarchical feature extraction layer; the feature extraction layers connected with any two feature processing modules are different; the characteristic processing module is used for further processing the hierarchical characteristics output by the corresponding characteristic extraction layer to obtain more accurate characteristics with discrimination; the characteristic fusion module is used for fusing the characteristics output by each characteristic processing module. The number of feature processing modules described above is typically greater than or equal to the number of feature extraction layers.

As shown in fig. 2, the method comprises the steps of:

step S202, processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism through each feature processing module, and outputting intermediate features;

in order to enable the hierarchical features output by the feature extraction layer to be more accurate and the feature information to be more obvious, the hierarchical features can be processed through the feature processing module. The attention mechanism is similar to that of different parts of human retina, and has different degrees of information processing capacity, namely, a target feature needing important attention is obtained by scanning hierarchical features output by a feature extraction layer connected with a feature processing module, then more attention resources are invested in the feature, more detailed information related to the target feature is obtained, and other irrelevant information is ignored. By the mechanism, high-value features can be quickly screened out from a large amount of information of the hierarchical features by using limited attention resources, and then intermediate features are output.

The at least two feature processing modules include: the system comprises a first characteristic processing module, a second characteristic processing module and a third characteristic processing module; the first feature processing module is connected with the feature extraction layer of the lowest level; the feature extraction layer of the lowest level is used for inputting a target image; the second feature processing module is connected with the specified feature extraction layer in the middle level; the third feature processing module is connected with the feature extraction layer of the highest level.

Referring to a schematic structural diagram of the feature extraction network shown in fig. 3, the feature extraction network is described by taking an example that the feature extraction network includes five feature extraction layers connected in series, namely, a feature extraction layer 1, a feature extraction layer 2, a feature extraction layer 3, a feature extraction layer 4, and a feature extraction layer 5; wherein the feature extraction layer 1 corresponds to the feature extraction layer of the lowest level; the feature extraction layer 2, the feature extraction layer 3 and the feature extraction layer 4 correspond to feature extraction layers in the intermediate hierarchy; the feature extraction layer 5 corresponds to the feature extraction layer of the highest hierarchy. In addition, the feature extraction network comprises three feature processing modules, namely a first feature processing module, a second feature processing module and a third feature processing module; respectively connected with the feature extraction layer 1, the feature extraction layer 3 and the feature extraction layer 5.

In addition, as shown in fig. 3, the feature processing module further includes a pooling layer, a first full-link layer, and a feature multiplication layer;

the pooling layer may also be called a pooling layer, and is mainly used for downsampling input features and the like to reduce the number of parameters; the first Fully Connected layers (FC for short) play a role of "classifier" in the whole convolutional neural network, and may perform a weighted sum on the features output by the previous layer, and map the feature space to the sample label space through linear transformation; the feature multiplication layer (multiplex) is mainly used for multiplying the hierarchical features and the features output by the first fully-connected layer.

One possible implementation:

performing first pooling on the input hierarchical features through a pooling layer, and outputting a first pooling result; performing first full-connection processing on the first pooling result through the first full-connection layer, and outputting a first full-connection result; multiplying the input hierarchical features and the first full-connection result through a feature multiplication layer to obtain a multiplication result; and outputting the intermediate feature based on the multiplication result.

Specifically, referring to the data flow in the diagram shown in fig. 3, taking the feature processing module as the first feature processing module for explanation, first, a target image X ∈ R with a size of H × W × 3 is obtained^H×W×3Inputting the data into a feature extraction layer of the lowest hierarchy in the feature extraction network to obtain hierarchical features, namely a feature matrix f 1E R^h1*w1*c1Where h1 represents the height of the feature matrix, w1 represents the width of the feature matrix, and c1 represents the number of channels of the feature matrix. The hierarchical characteristic f1 belongs to R^h1*w1*c1The pooling layer input into the first feature processing module performs a first pooling process, and outputs a first pooling result, i.e. a feature vector f 1' e R^c1(ii) a The first pooling result f 1' e R^c1Inputting the result into a first full-connection layer, performing first full-connection processing on the first pooling result, and outputting a first full-connection result f 1' e.g. R^c1(ii) a Making the hierarchy characteristic f1 be belonged to R^h1*w1*c1And the first full connection result f 1' e R^c1Multiplying to obtain the multiplication result f 1'. epsilon.R^h1*w1*c1(ii) a Finally, outputting intermediate characteristics based on the multiplication result; the intermediate features may represent underlying features of the target image.

Referring to fig. 3, the feature processing module further includes a spatial pyramid pooling layer; the spatial pyramid pooling layer is connected with the feature multiplication layer; the Spatial Pyramid Pooling (SPP) layer mainly functions to process different hierarchical features to obtain features with the same dimension.

The step of outputting the intermediate feature based on the multiplication result is one possible implementation: and performing second pooling on the multiplication result through the spatial pyramid pooling layer, and outputting the intermediate feature of the specified dimension. Wherein, the specified dimension can be set according to the actual application.

The multiplication result f 1'. epsilon.R is obtained by the method^h1*w1*c1Thereafter, the multiplication result f 1'. epsilon.R can be set^h1*w1*c1Inputting the result into a spatial pyramid pooling layer, performing second pooling on the multiplication result, and outputting an intermediate feature of a specified dimension, namely a feature vector f 1' epsilon R^c。

Similarly, taking the feature processing module as the second feature processing module as an example, the hierarchy feature f3 ∈ R output by the specified feature extraction layer 3 of the middle hierarchy is also output^h3*w3*c3Inputting the second intermediate characteristic f 3' E R to the second characteristic processing module^c(ii) a Taking the feature processing module as the third feature processing module as an example, the hierarchical feature f5 e R output by the highest hierarchical feature extraction layer 5 is also selected^h5*w5*c5Inputting the third intermediate feature f 5' E R^c. And the spatial pyramid pooling layers in the first feature processing module, the second feature processing module and the third feature processing module respectively output the intermediate features with the same dimensionality.

Step S204, fusing the intermediate features output by each feature processing module through a feature fusion module to obtain image features;

specifically, the intermediate features output by each feature processing module, i.e., the first feature processing module, the second feature processing module, and the third feature processing module, are input to the feature fusion module, and each intermediate feature is fused by means of feature splicing and the like to obtain a multi-level fusion feature, i.e., the image feature.

Referring to FIG. 3, the structure of the feature extraction module is shownThe feature fusion module comprises a feature splicing layer, a second full connection layer and a third full connection layer; wherein, the characteristic splicing layer (concatenate) is mainly used for splicing each intermediate characteristic f 1' epsilon R^c、f3″″∈R^c、f5″″∈R^cSplicing treatment is carried out; the second full link layer and the third full link layer function the same as the first full link layer.

The above step of obtaining the image features by fusing the intermediate features output by each feature processing module through the feature fusion module is one possible implementation manner:

splicing the intermediate features output by each feature processing module through the feature splicing layer, and outputting splicing features; performing second full-connection processing on the splicing characteristics through a second full-connection layer, and outputting a second full-connection result; and performing third full-connection processing on the second full-connection result through a third full-connection layer, and outputting image characteristics.

Specifically, the intermediate feature f 1'. epsilon.R of the target image X^c、f3″″∈R^c、f5″″∈R^cSplicing processing is carried out, and the splicing characteristic f is output as R^3c(ii) a The splicing characteristic f is equal to R^3cInputting the second full-connection layer to perform second full-connection processing on the splicing characteristics, and outputting a second full-connection result; inputting the second full-connection result into a third full-connection layer, performing third full-connection processing on the second full-connection result, and outputting an output vector of the network, namely, the image characteristic z is R³。

Step S206, inputting the image characteristics into a preset normalization index function, and outputting a probability distribution vector; the probability distribution vector comprises a plurality of categories and probability values corresponding to the categories;

the normalized exponential function may be a softmax function; the probability distribution vector may be denoted as p; the probability distribution vector can be specifically calculated by the following formula:

wherein p represents a summaryA rate distribution vector; z represents an image feature; m represents an mth feature processing template; p is a radical of_iAnd z_iThe ith element representing p and z, respectively; e represents a natural constant;

in step S208, the category corresponding to the maximum probability value in the probability distribution vector is determined as the category of the target image.

Specifically, the formula k may be argmax_i(p_i) Determining a coordinate corresponding to the maximum probability value in the probability distribution vector, namely the category of the icon image; taking three preset categories as examples, where k-1 represents that the target image is a normal image, k-2 represents that the target image is a low-grade image, and k-3 represents that the target image is a pornographic image.

In the method, a plurality of feature extraction layers included in a feature extraction network, and a first feature processing module, a second feature processing module and a third feature processing module which are connected with different feature extraction layers are used; the method has the advantages that a plurality of hierarchical features in the target image can be extracted, the hierarchical features are processed through the feature processing module to obtain intermediate features, discrimination of the hierarchical features is improved, and image information contained in the intermediate features is more accurate and rich; the method does not need to manually design the characteristics of the target image, automatically extracts the characteristics through a convolutional neural network, classifies the effective characteristics of the image, and has strong algorithm generalization capability and high robustness; by fusing the intermediate features of the bottom layer, the middle layer and the upper layer of the image output by the feature extraction module and carrying out live broadcast scene sensitive image identification through the image features, the sensitive image can be effectively identified, and the identification precision of the feature extraction network is improved; the omission factor and the false detection rate of the sensitive image are reduced; the accuracy of identifying sensitive images is improved.

When the live broadcast scene is oriented, live broadcast images can be classified in the above mode to identify sensitive images in the images, so that the purpose of intelligently monitoring a network live broadcast room is achieved, and meanwhile, the labor cost is reduced.

The embodiment also provides a training method of a feature extraction network, as shown in fig. 4, the method includes the following steps:

step S402, determining a training sample based on a preset sample set; the training sample comprises a sample image and a class label of the sample image;

specifically, detailed image classification criteria can be designed online, and can include normal, vulgar, pornography and violence; (for example, the landscape is a normal image, the naked genitalia is a pornographic image, the kiss is a vulgar image, the brute force image caused by knife wound is taken, and the like), the data set D can be obtained by manually marking the massive live pictures according to the standard, and a part of the data set can be used as a training sample D according to a certain proportion_trainThe remaining part is used as a test sample D_test(ii) a Specifically, the data set D may be represented by 10: 1 into training samples D_trainAnd test specimen D_test(ii) a The training sample comprises a sample image and four class labels of the sample image, namely, the labeling result can be expressed as y e {1,2,3,4}, 1 represents normal, 2 represents vulgar, 3 represents pornography, and 4 represents violence; the image classification is limited to the classification described herein, and may include other classifications, such as, for example, leaked national secrets, illegal disciplines, etc.

Step S404, inputting the sample image into a feature extraction network to obtain the sample feature of the sample image; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies;

step S406, determining a category identification result of the sample image based on the sample characteristics; determining a loss value based on a preset classification loss function, a class label and a class identification result; updating the network parameters of the feature extraction network based on the loss values;

and step S408, continuing to execute the step of determining the training sample based on the preset sample set until the loss value is converged to obtain the trained feature extraction network.

Specifically, the class identification result z of the sample image may be R³Input to the softmax function by the equation

Calculating a probability distribution vector p; then by the formula L ═ log (p)_y) A loss value is calculated, where y is the label of the training sample image. Finally, calculating the derivative of the loss value to all network parameters W in the feature extraction network by a back propagation algorithm

Updating network parameters of the feature extraction network based on the loss values obtained by calculation through a random gradient descent algorithm, namely:

wherein α represents a learning rate (a preset hyper-parameter, and the values are usually 0.01 and 0.001); and continuously and iteratively updating the parameters of the feature extraction network until the loss value is converged to obtain the trained feature extraction network.

In addition, the test sample D needs to be passed after the training is completed_testTesting the trained feature extraction network, selecting a plurality of test images from the test sample, inputting the test images into the trained feature extraction network, and obtaining an output vector, namely image features; and determining the category of the test image based on the image characteristics, comparing the category with the labeled image, and obtaining a trained feature extraction network if the category meets the preset conditions.

In the training method for the feature extraction network provided by this embodiment, a training sample is determined based on a preset sample set; the training sample comprises a sample image and a class label of the sample image; inputting the sample image into a feature extraction network to obtain sample features of the sample image; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies; determining a category identification result of the sample image based on the sample characteristics; determining a loss value based on a preset classification loss function, a class label and a class identification result; updating the network parameters of the feature extraction network based on the loss values; and continuing to execute the step of determining the training sample based on the preset sample set until the loss value is converged to obtain the trained feature extraction network. In the method, the feature extraction network comprises a plurality of levels of feature extraction layers, at least two levels of features in the target image can be extracted, the at least two levels of features are fused to obtain image features, the category of the target image is determined based on the image features, and the image features comprise the at least two levels of features, so that the feature levels contained in the image features are richer, and the image recognition under complex scenes such as live broadcast can be responded, so that the sensitive image can be accurately and effectively recognized, and the omission ratio and the false detection ratio of the sensitive image are reduced.

In a live broadcast scene, massive data are collected and marked to serve as training samples, a detailed label classification standard is provided to strongly mark the training samples, and the obtained feature extraction network better meets the supervision requirement of the live broadcast scene.

Corresponding to the above method embodiment, this embodiment further provides an image classification apparatus, as shown in fig. 5, the apparatus includes:

an output module 51, configured to input the target image into a pre-trained feature extraction network, and output an image feature of the target image;

a classification module 52 for determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies.

The image classification device provided by the embodiment of the invention inputs a target image into a pre-trained feature extraction network and outputs the image features of the target image; determining a category of the target image based on the image features; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies. In the method, the image features comprise the hierarchical features of at least two hierarchies, so that the feature hierarchies contained in the image features are richer, and the method can be applied to image recognition in complex scenes such as live broadcast and the like, so that the sensitive images can be accurately and effectively recognized, and the omission factor and the false detection rate of the sensitive images are reduced.

Further, the feature extraction network further comprises at least two feature processing modules and a feature fusion module; each feature processing module is connected with a hierarchical feature extraction layer; the feature extraction layers connected with any two feature processing modules are different; the output module is further configured to: processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism through each feature processing module, and outputting intermediate features; and fusing the intermediate features output by each feature processing module through the feature fusion module to obtain the image features.

Further, the feature processing module comprises a pooling layer, a first full-link layer and a feature multiplication layer; the output module is further configured to: performing first pooling on the input hierarchical features through a pooling layer, and outputting a first pooling result; performing first full-connection processing on the first pooling result through the first full-connection layer, and outputting a first full-connection result; multiplying the input hierarchical features and the first full-connection result through a feature multiplication layer to obtain a multiplication result; and outputting the intermediate feature based on the multiplication result.

Further, the feature processing module further includes a spatial pyramid pooling layer; the spatial pyramid pooling layer is connected with the feature multiplication layer; the output module is further configured to: and performing second pooling on the multiplication result through the spatial pyramid pooling layer, and outputting the intermediate feature of the specified dimension.

Further, the feature fusion module comprises a feature splicing layer, a second full connection layer and a third full connection layer; the output module is further configured to: splicing the intermediate features output by each feature processing module through the feature splicing layer, and outputting splicing features; performing second full-connection processing on the splicing characteristics through a second full-connection layer, and outputting a second full-connection result; and performing third full-connection processing on the second full-connection result through a third full-connection layer, and outputting image characteristics.

Further, the classification module is further configured to: inputting the image characteristics into a preset normalization index function, and outputting a probability distribution vector; the probability distribution vector comprises a plurality of categories and probability values corresponding to the categories; and determining the category corresponding to the maximum probability value in the probability distribution vector as the category of the target image.

The image classification device provided by the embodiment of the invention has the same technical characteristics as the image classification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

Corresponding to the above method embodiment, this embodiment further provides a training apparatus for a feature extraction network, as shown in fig. 6, the apparatus includes:

a sample determining module 61, configured to determine a training sample based on a preset sample set; the training sample comprises a sample image and a class label of the sample image;

the image input module 62 is configured to input the sample image into the feature extraction network to obtain a sample feature of the sample image; the feature extraction network comprises a plurality of cascaded feature extraction layers; each level feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies;

a parameter updating module 63, configured to determine a category identification result of the sample image based on the sample feature; determining a loss value based on a preset classification loss function, a class label and a class identification result; updating the network parameters of the feature extraction network based on the loss values;

and a network determining module 64, configured to continue to perform the step of determining the training sample based on the preset sample set until the loss value converges, so as to obtain the trained feature extraction network.

The training device for the feature extraction network provided by the embodiment of the invention has the same technical features as the training method for the feature extraction network provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

The embodiment also provides an electronic device, which comprises a processor and a memory, wherein the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the image classification method or the training method of the feature extraction network.

Referring to fig. 7, the electronic device includes a processor 100 and a memory 101, where the memory 101 stores machine executable instructions capable of being executed by the processor 100, and the processor 100 executes the machine executable instructions to implement the image classification method or the training method of the feature extraction network.

Further, the electronic device shown in fig. 7 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.

The Memory 101 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

Processor 100 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 100. The Processor 100 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The present embodiments also provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the image classification method described above, or the training method of a feature extraction network.

The computer program product of the image classification method, the training method of the feature extraction network, and the device provided in the embodiments of the present invention includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases for those skilled in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of image classification, the method comprising:

inputting a target image into a pre-trained feature extraction network, and outputting the image features of the target image; determining a category of the target image based on the image features;

wherein the feature extraction network comprises a plurality of cascaded feature extraction layers; each level of the feature extraction layer is used for outputting level features corresponding to the current level; the image features are obtained by fusing the hierarchical features corresponding to at least two hierarchies.

2. The method of claim 1, wherein the feature extraction network further comprises at least two feature processing modules, and a feature fusion module; each feature processing module is connected with a hierarchical feature extraction layer; the feature extraction layers connected with any two feature processing modules are different;

the step of inputting the target image into a pre-trained feature extraction network and outputting the image features of the target image comprises the following steps:

processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism through each feature processing module, and outputting intermediate features;

and fusing the intermediate features output by each feature processing module through the feature fusion module to obtain the image features.

3. The method of claim 2, wherein the at least two feature processing modules comprise: the system comprises a first characteristic processing module, a second characteristic processing module and a third characteristic processing module;

the first feature processing module is connected with the feature extraction layer of the lowest level; the lowest-level feature extraction layer is used for inputting the target image;

the second feature processing module is connected with a specified feature extraction layer in the middle level; the third feature processing module is connected with the feature extraction layer of the highest level.

4. The method of claim 2, wherein the feature processing module comprises a pooling layer, a first fully connected layer, and a feature multiplication layer;

the step of processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on the attention mechanism and outputting intermediate features comprises the following steps:

performing first pooling on the input hierarchical features through the pooling layer, and outputting a first pooling result; performing first full-connection processing on the first pooling result through the first full-connection layer, and outputting a first full-connection result;

multiplying the input hierarchical features and the first full-connection result through the feature multiplication layer to obtain a multiplication result; outputting the intermediate feature based on the multiplication result.

5. The method of claim 4, wherein the feature processing module further comprises a spatial pyramid pooling layer; the spatial pyramid pooling layer is connected with the feature multiplication layer;

the step of outputting the intermediate feature based on the multiplication result includes:

and performing second pooling on the multiplication result through the spatial pyramid pooling layer, and outputting the intermediate feature with the specified dimensionality.

6. The method of claim 2, wherein the feature fusion module comprises a feature stitching layer, a second fully connected layer, and a third fully connected layer;

the step of obtaining the image features by fusing the intermediate features output by each feature processing module through the feature fusion module includes:

splicing the intermediate features output by each feature processing module through the feature splicing layer, and outputting splicing features;

performing second full-connection processing on the splicing characteristics through the second full-connection layer, and outputting a second full-connection result; and performing third full-connection processing on the second full-connection result through the third full-connection layer, and outputting the image characteristics.

7. The method of claim 1, wherein the step of determining the class of the target image based on the image feature comprises:

inputting the image characteristics into a preset normalization index function, and outputting a probability distribution vector; wherein the probability distribution vector comprises a plurality of categories and a probability value corresponding to each category;

and determining the category corresponding to the maximum probability value in the probability distribution vector as the category of the target image.

8. A method of training a feature extraction network, the method comprising:

determining a training sample based on a preset sample set; wherein the training sample comprises a sample image and a class label for the sample image;

inputting the sample image into a feature extraction network to obtain sample features of the sample image; wherein the feature extraction network comprises a plurality of cascaded feature extraction layers; each level of the feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies;

determining a class identification result of the sample image based on the sample feature; determining a loss value based on a preset classification loss function, the class label and the class identification result; updating network parameters of the feature extraction network based on the loss values;

and continuing to execute the step of determining a training sample based on a preset sample set until the loss value is converged to obtain the trained feature extraction network.

9. An image classification apparatus, characterized in that the apparatus comprises:

the output module is used for inputting the target image into a pre-trained feature extraction network and outputting the image features of the target image;

a classification module for determining a category of the target image based on the image features;

10. An apparatus for training a feature extraction network, the apparatus comprising:

the sample determining module is used for determining a training sample based on a preset sample set; wherein the training sample comprises a sample image and a class label for the sample image;

the image input module is used for inputting the sample image into a feature extraction network to obtain the sample feature of the sample image; wherein the feature extraction network comprises a plurality of cascaded feature extraction layers; each level of the feature extraction layer is used for outputting level features corresponding to the current level; the sample characteristics are obtained by fusing the hierarchical characteristics corresponding to at least two hierarchies;

the parameter updating module is used for determining a category identification result of the sample image based on the sample characteristics; determining a loss value based on a preset classification loss function, the class label and the class identification result; updating network parameters of the feature extraction network based on the loss values;

and the network determining module is used for continuously executing the step of determining the training sample based on the preset sample set until the loss value is converged to obtain the trained feature extraction network.

11. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of image classification of any one of claims 1-7 or the method of training a feature extraction network of claim 8.

12. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the image classification method of any one of claims 1 to 7, or the training method of the feature extraction network of claim 8.