CN113569986B

CN113569986B - Computer vision data classification method, device, electronic equipment and storage medium

Info

Publication number: CN113569986B
Application number: CN202110948959.8A
Authority: CN
Inventors: 杨杨; 姜波; 胡光龙; 唐景群; 吴凯琳
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-06-30
Anticipated expiration: 2041-08-18
Also published as: CN113569986A

Abstract

The present disclosure provides a computer vision data classification method, a computer vision data classification device, an electronic apparatus, and a computer-readable storage medium; relates to the technical field of computers. The method comprises the following steps: extracting computer vision data features from the input plurality of computer vision data to obtain first features and second features; extracting the second feature through the primary classification model to obtain a third feature; performing primary classification on the plurality of computer vision data based on the third characteristic through a primary classification model to obtain predicted values of the plurality of computer vision data belonging to each category; determining confusion data and confusion classes in the plurality of computer vision data based on the predicted values; and performing secondary classification on the confusion data based on the confusion class, the first feature and the third feature through a secondary classification model to obtain a classification result of the confusion data. The present disclosure may address unreasonable consumption of classification time and resources due to input data that does not distinguish between models.

Description

Computer vision data classification method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a computer vision data classification method, a computer vision data classification device, an electronic apparatus, and a computer readable storage medium based on computer technology.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the development of computer technology, deep learning is increasingly applied to various specific fields. Taking the field of data classification as an example, the deep learning can meet various data classification requirements such as image classification, text classification, video classification and the like.

However, in the practical application scenario of classification, the data to be classified may be often divided into a distinct data set a with distinct distinguishing features and a confusing data set B with features that are not easily distinguished. For the set A, the classification model can accurately identify the characteristics of the data in the set only by taking a short time, so that good classification accuracy is obtained; but for set B it often takes a long time to perform multiple rounds of iterative recognition and it is difficult to obtain satisfactory classification results. The existing data classification method does not distinguish input data of a classification model, so that on one hand, the model needs more recognition time cost and machine resources to recognize data which can be recognized rapidly, and on the other hand, specific feature recognition cannot be performed on mixed data which are difficult to distinguish, and the classification capability of the model is weak.

Disclosure of Invention

In view of this, there is a need for a computer vision data classification scheme that can solve, at least to some extent, the problems of unreasonable consumption of resources and time and poor classification ability of models due to the lack of distinguishing input data of models.

In this context, embodiments of the present disclosure desirably provide a computer vision data classification method, a computer vision data classification apparatus, an electronic device, and a computer-readable storage medium.

According to a first aspect of the present disclosure, there is provided a computer vision data classification method, comprising: extracting computer vision data features from the input plurality of computer vision data to obtain first features and second features; extracting the second feature through a primary classification model to obtain a third feature; performing primary classification on the plurality of computer vision data based on the third feature through the primary classification model to obtain predicted values of the plurality of computer vision data belonging to each category; determining confusion data and confusion classes in the plurality of computer vision data based on the predicted values, wherein the confusion classes are classes to which the confusion data are judged by the primary classification model; and performing secondary classification on the confusion data based on the confusion class, the first feature and the third feature through a secondary classification model to obtain a classification result of the confusion data.

Optionally, the loss function of the primary classification model includes a cross entropy function, and the loss function of the secondary classification model includes a distance function component and a cross entropy function component between confusion classes, where a confusion class is a class to which the confusion data is discriminated by the primary classification model.

Optionally, the extracting data features from the input plurality of computer vision data includes: a data feature is extracted from the plurality of computer vision data using a three-dimensional deep neural network or a two-dimensional deep neural network.

Optionally, the determining, based on the predicted value, confusion data and confusion class in the plurality of computer vision data includes: determining a maximum value of the predicted values of each computer vision data and comparing the maximum value with a true label of each computer vision data; determining computer vision data corresponding to a maximum value unequal to the real label as confusion data; and determining the product of the maximum value and a super parameter as a reference value; and determining the category corresponding to the predicted value larger than the reference value as the confusion category.

Optionally, the classifying the confusion data based on the first feature and the third feature includes: and carrying out feature fusion processing on the first feature and the third feature to obtain a fourth feature, and carrying out secondary classification on the confusion data based on the fourth feature.

Optionally, before the extracting the computer vision data features from the input plurality of computer vision data, the method further comprises: a primary training step, which comprises inputting a plurality of computer vision training data, and training the primary classification model based on the plurality of computer vision training data to determine confusion training data and confusion training classes in the plurality of computer vision training data, wherein the confusion training classes are classes to which the confusion training data are judged by the primary classification model; a first training condition judgment step of determining whether a first training condition is reached; if not, returning to the primary training step; if yes, turning to a secondary training step, wherein the secondary training step comprises training the secondary classification model based on the confusion training data and the confusion training class; a second training condition judgment step of determining whether a second training condition is reached; if not, returning to the secondary training step; if yes, the method goes to a third training condition judging step, wherein the third training condition judging step comprises determining whether a third training condition is reached; if not, returning to the primary training step; if so, the training is stopped.

Optionally, the training the primary classification model based on the plurality of computer vision training data includes: extracting computer vision data features from the plurality of computer vision training data to obtain first training features and second training features; extracting the second training features through a primary classification model to obtain third training features; performing primary training classification on the plurality of computer vision training data based on the third training features through the primary classification model to obtain training predicted values of the plurality of computer vision training data belonging to each training class; determining the confusion training data and the confusion training class in the plurality of computer vision training data based on the training predictor.

Optionally, the training the secondary classification model based on the confusion training data and the confusion training class includes: and performing secondary training classification on the confusion training data through the secondary classification model and based on the confusion training class, the first training feature and the third training feature so as to obtain a classification result of the confusion training data.

Optionally, the determining whether the first training condition is reached includes: determining whether the number of times of training the primary classification model reaches a preset primary classification training threshold; the determining whether the second training condition is reached includes: and determining whether the number of times of training the secondary classification model reaches a preset secondary classification training threshold.

Optionally, the determining whether the third training condition is reached includes: determining whether the sum of the number of times the primary classification model is trained and the number of times the secondary classification model is trained reaches a preset total training number; or whether the classification accuracy reaches a preset accuracy threshold, wherein the classification accuracy is a ratio of the number of correctly classified computer vision training data to the number of the plurality of computer vision training data.

According to a second aspect of the present disclosure, there is provided a computer vision data classifying apparatus, comprising: the feature extraction module is used for extracting data features from a plurality of computer vision data input by a user to obtain a first feature and a second feature; the primary classification module is used for extracting the second features to obtain third features; and performing primary classification on the plurality of computer vision data based on the third feature to obtain predicted values of the plurality of computer vision data belonging to each category; the data judging module is used for determining confusion data and confusion classes in the plurality of computer vision data based on the predicted values, wherein the confusion classes are the classes to which the confusion data are judged by the primary classification model; and a secondary classification module, configured to perform secondary classification on the confusion data based on the first feature and the third feature, so as to obtain a classification result of the confusion data.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to the computer vision data classification method, the computer vision data classification device, the electronic equipment and the computer readable storage medium, on one hand, by determining the confusion data in the input computer vision data and classifying the confusion data independently, the classification of the transparent data is not required to be performed by consuming the same time and resources as the confusion data, so that the classification efficiency is improved, and the machine resources are saved. On the other hand, the mixed data is secondarily classified by inputting multi-level features with different dimensions, namely the first feature and the third feature, so that the classification accuracy of the mixed data is effectively improved, and the classification capability of the classification model is correspondingly improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of an application scenario of a computer vision data classification method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a computer vision data classification method in accordance with one embodiment of the present disclosure;

FIG. 3 schematically illustrates an architecture diagram of a 3-dimensional network for feature extraction of input data according to one embodiment of the disclosure;

FIG. 4 schematically illustrates an architecture diagram of a network for implementing a computer vision data classification method in accordance with one embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram for training a classification model according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a block diagram of a computer vision data classification device in accordance with one embodiment of the present disclosure;

fig. 7 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a computer vision data classification method, a computer vision data classification apparatus, an electronic device, and a computer-readable storage medium are provided.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

In the related art for classifying data based on deep learning, a distinct data set a having distinct distinguishing features and a confusing data set B having features which are not easily distinguished are often mixed together for classification processing. That is, since the data set a and the data set B cannot be distinguished artificially, the classification model is input by mixing the transparent data with the confusion data in the related art. This results in that even for clear data which requires a short time to correctly identify the classification, it takes the same time to identify the classification as the obfuscated data, thus reducing the classification efficiency on the one hand and additionally consuming hardware machine resources on the other hand. In addition, the classification capability of the classification model is often measured according to the accuracy of correctly classifying the confusion data, so that the way of mixing the confusion data with the definition data also causes the problems that the classification result of the classification model on the confusion data is inaccurate and the classification capability is lower.

Similarly, the above-described manner of classifying without distinguishing between distinct data and aliased data also suffers from the same problems when training the classification model. For the data set A, the classification model can be fitted with the characteristics of the data only by training in a short time, so that a good training effect and recognition accuracy are obtained. However, for data set B, multiple rounds of training are typically required, taking a long time to barely fit the features of the data, and the effect is often not satisfactory. Therefore, when the data set a and the data set B are mixed together to be used as training samples of the classification model, in order to obtain better classification accuracy for the confusion data, a training mode for the confusion data is adopted for all samples, so that unnecessary training time cost and computer hardware resources are consumed, and specific feature mapping is difficult to be carried out on the confusion samples, so that the fitting capability of the classification model is weak.

The inventors found that the above problems can be solved well by dividing the input data into distinct data and aliased data and classifying the distinct data and the aliased data by different classification models, respectively.

Based on the above, the basic idea of the present disclosure is that: using a primary classification model special for classifying the distinct data, performing primary classification on the input data based on high-level features extracted from the input data, and determining which data belong to the distinct data and which data belong to the confusing data according to the result of the primary classification; for the confusion data, a secondary classification model special for classifying the confusion data is used, and the lower-layer characteristics and the higher-layer characteristics are introduced for secondary classification, so that a classification result for the confusion data is obtained.

According to the technical scheme, on one hand, the confusion data in the input computer vision data are determined, and the confusion data are classified independently, so that the classification of the clear data is not required to be performed by consuming the same time and resources as those of the confusion data, the classification efficiency is improved, and the machine resources are saved. On the other hand, the mixed data is secondarily classified by inputting multi-level features with different dimensions, namely the first feature and the third feature, so that the classification accuracy of the mixed data is effectively improved, and the classification capability of the classification model is correspondingly improved.

Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.

Application scene overview

It should be noted that the following application scenarios are only shown for facilitating understanding of the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Fig. 1 shows an application scenario of a computer vision data classification method according to an embodiment of the present disclosure, wherein a system architecture 100 may comprise one or more of

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 is used to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may be accessed in various types of connections, such as wired, wireless communication links, or fiber optic cables. The

terminal devices

101, 102, 103 may be various electronic devices having data computing processing capabilities including, for example, but not limited to, desktop computers, portable computers, personal Digital Assistant (PDA) devices, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks and servers as practical. For example, the server 105 may be a server cluster formed by a plurality of servers.

For example, in one exemplary embodiment, a user may input data (e.g., pictures, videos, texts, etc.) to be classified to the

terminal device

101, 102, or 103, and the

terminal device

101, 102, or 103 may upload the data to be classified to the server 105 through the network 104, and the server 105 may complete classification of the data based on the computer vision data classification method according to the embodiment of the present disclosure, and issue the classification result to the

terminal device

101, 102, or 103, so as to feed back the classification result to the user.

In addition, after receiving the data to be classified input by the user, the

terminal device

101, 102 or 103 may finish classifying the data based on the computer vision data classifying method according to the embodiment of the disclosure and feed back the classifying result to the user, which is only required that the

terminal device

101, 102 or 103 has the processing computing capability required by satisfying the algorithm.

It should be understood by those skilled in the art that the above application scenario is only for example, and the present exemplary embodiment is not limited thereto.

By the method for classifying the computer vision data, the problems of unreasonable consumption of classification time and hardware machine resources and weaker classification capability of the model caused by the fact that input data of the model are not distinguished can be solved.

Exemplary method

A computer vision data classification method according to an aspect of an exemplary embodiment of the present disclosure is described with reference to fig. 2.

The present example embodiment provides a computer vision data classification method. Referring to fig. 2, the computer vision data classification method may include the steps of:

s210, extracting computer vision data features from a plurality of input computer vision data to obtain first features and second features;

s220, extracting the second features through a primary classification model to obtain third features;

s230, performing primary classification on the plurality of computer vision data based on the third feature through the primary classification model to obtain predicted values of the plurality of computer vision data belonging to each category;

s240, determining confusion data and confusion classes in the plurality of computer vision data based on the predicted values, wherein the confusion classes are classes to which the confusion data are judged by the primary classification model;

s250, performing secondary classification on the confusion data through a secondary classification model and based on the confusion class, the first feature and the third feature, so as to obtain a classification result of the confusion data.

In the computer vision data classification method provided above, a three-dimensional (3D) deep neural network or a two-dimensional (2D) deep neural network may be selected according to the input data type to extract the first and second features from the input data; further extracting the second feature by a primary classification model dedicated to classifying the distinct data through a Full Connection (FC) layer to obtain a third feature; primary classifying the input data by the primary classifying model based on the third characteristic and through another full-connection layer, and determining confusion data based on predicted values of each input data belonging to each category; and fusing (concat) the first feature and the third feature, taking the confusion data as the input of a secondary classification model, and enabling the secondary classification model to conduct secondary classification on the confusion data through a full connection layer based on the fused features so as to obtain a classification result of the confusion data. By the computer vision data classification method, the problems of unreasonable consumption of classification time and hardware machine resources and weaker classification capability of the model caused by the fact that input data of the classification model are not distinguished can be solved.

In another embodiment, the above steps are described in more detail below.

In step S210, computer vision data features are extracted from the input plurality of computer vision data to obtain first features and second features.

In this example embodiment, the computer vision data entered by the user may include, but is not limited to, text, pictures, video, etc., and a deep neural network may be employed to extract computer vision data features from the entered computer vision data. In one example, different types of Deep Neural Networks (DNNs) may be selected accordingly, depending on the type of data being input. For example, when the input data is of a picture type, a two-dimensional (2D) depth neural network may be selected to extract features from the input picture data, since the picture belongs to two-dimensional image data; and when the input data is of a video type, a three-dimensional (3D) depth neural network may be selected to extract features from the input video data, since the video belongs to image data with time-series data. In this way, a proper deep neural network can be configured for different types of input data to perform feature extraction, so that the accuracy of the extracted features is ensured. In a subsequent embodiment of the present disclosure, the method according to the present disclosure is described taking as an example that the input data is video data, and the features are extracted using a three-dimensional convolutional neural network (3D CNN) accordingly.

As shown in fig. 3, fig. 3 shows a schematic architecture diagram of a three-dimensional convolutional neural network 310 for extracting computer vision data features from input video data. The three-dimensional convolutional neural network 310 may include, for example, a plurality of convolutional layers, a plurality of pooled layers, and at least one fully-connected layer, and the plurality of convolutional layers and the plurality of pooled layers may be, for example, alternately arranged. The input video data may be, for example, 320×240 resolution video data, and may be subjected to a frame extraction process prior to input into the three-dimensional convolutional neural network 310, for example, 16 frames of data samples may be acquired from the input video data at fixed time intervals; the fixed time interval may be flexibly set according to the time length of the video and the actual requirement of feature extraction, for example, the fixed time interval may be set to 5 seconds or other time periods, and the total frame number of the data samples collected from the video data may also be flexibly set according to the actual requirement, which is not particularly limited in this exemplary embodiment.

After the data sample frames of the video data are input into the three-dimensional convolutional neural network 310, the data sample frames are sequentially subjected to operation processing through each convolutional layer and the pooling layer, so that feature extraction of the data sample frames is completed. Wherein the extracted features are in the form of vectors and the dimensions of the features may be changed each time a convolutional layer is extracted. For example, "128" in the convolution layer 2a 301 in fig. 3 indicates that the number of dimensions of the extracted features is 128 after the data sample frame is subjected to feature extraction by the convolution layer 2a 301; similarly, "4096" in fully connected layer 6 302 indicates that the number of dimensions of the feature after processing through fully connected layer 6 302 is 4096.

In the three-dimensional convolutional neural network 310, the feature that is extracted or processed by any one of the convolutional layer, the pooling layer, or the full-connection layer may be used as the first feature or the second feature. In a subsequent embodiment of the present disclosure, the feature output by the pooling layer 4 is taken as a first feature, and the feature output by the fully connected layer 6 302 is taken as a second feature. It should be noted that the determined first feature and the second feature are not limited to the above exemplary case, and features output by other convolution layers, pooling layers, or full-connection layers may be used as the first feature or the second feature; also in the above example, the first feature and the second feature each correspond to one feature of one layer output of the three-dimensional convolutional neural network 310, however, the first feature and the second feature may also each correspond to a plurality of features of a plurality of layer outputs, and the plurality of features may be combined to form the first feature or the second feature, which is not particularly limited in this example embodiment.

In step S220, the second feature is extracted by the primary classification model to obtain a third feature.

In this example embodiment, as shown in fig. 4, a three-dimensional convolutional neural network 410 for extracting computer vision data features may include a pooling layer 4 and a full-connection layer 6, the features output by the pooling layer 4 may be input as first features to a convolutional layer a of a secondary classification model 430, and subsequent dimension reduction may be performed through the pooling layer a and the full-connection layer 3 of the secondary classification model 430, thereby obtaining bottom features with 512 dimensions in number after dimension reduction; and the features output by the fully connected layer 6 may be input as second features to fully connected layer 1 of the primary classification model 420. In case the feature output by the fully connected layer 6 is a D (D for example equal to 4096) dimensional feature, the fully connected layer 1 may be set to a D x D for example ₁ (D ₁ For example equal to 1024) such that the second feature of the D dimension output from the fully connected layer 6 can be matched with D x D after input to the fully connected layer 1 ₁ Multiplication operation is carried out on vector matrix of (2), and the dimension quantity D is obtained by output ₁ (1024) The high-level features of the dimension, i.e., the third feature.

The reason why the bottom-level features are extracted from the pooling layer 4 and the full-connection layer 6 of the three-dimensional convolutional neural network 410, respectively, and one bottom-level feature is further extracted as a high-level feature via the full-connection layer 1 of the primary classification model 420 is that: unlike the more significant differences between the distinct data, there is often only a difference in local detail between the aliased data (e.g., video on a cricket and video on a cricket), so global information and local information often need to be considered in combination when distinguishing the aliased data. Numerous studies have shown that high-level features are suitable for capturing global-based semantic information, while low-level features are suitable for capturing detailed information. Therefore, when distinguishing the distinct data, the difference between the distinct data can be well expressed through global semantic information, so that the identification and classification can be performed based on the third feature which is used for further extracting the second feature, namely the high-level feature; when distinguishing the confusion data, the method can be additionally connected with the high-level features, namely the third features, as auxiliary recognition means besides the recognition and classification based on the detail information carried by the bottom features, so that the accuracy of recognizing and classifying the confusion data can be improved, and the number and the dimension of the bottom features to be adopted can be correspondingly reduced.

In step S230, the plurality of computer vision data is primarily classified by the primary classification model and based on the third feature, so as to obtain predicted values of the plurality of computer vision data belonging to each category.

In this example embodiment, as shown in fig. 4, the primary classification model 420 may further include a fully connected layer 2, which fully connected layer 2 is capable of primary classification of input computer vision data based on a third feature. After passing through the full connection layer 1, D is obtained ₁ (1024) After the third feature of the dimension, the fully connected layer 2 may be set to D, for example ₁ X N, where N is the number of categories to be classified for the input data, the value of N may be set to 101, for example, in this example embodiment. The output of the full connection layer 2 is a predicted value to which the input data belongs in each category. It should be noted that, in the example shown in fig. 4, the primary classification model 420 may include a full connection layer 1 and a full connection layer 2; however, the above-described form including 2 fully connected layers is only the simplest exemplary form required to implement the functionality of the primary classification model. Indeed, the primary classification model 420 may also include other fully connected layers or have other forms based on this simplest exemplary form, which the present example embodiment is not particularly limited to.

As described above, when both the transparent data and the confusing data are included in the input computer vision data, the result of the primary classification may include both the case of the correct classification and the case of the erroneous classification, respectively. In one example, the loss function loss of the primary classification model may be, for example ₁ Is arranged to include a cross entropy function, namely:

wherein:

wherein s is _i Predicting a predicted value of the data which is currently classified and belongs to the ith class for the primary classification model, wherein the range of the predicted value is between 0 and 1; p (P) _i For the probability that the currently classified data belongs to the ith class, the P _i Is to s _i Carrying out normalization treatment to obtain the product; y is _i ' is the real label of which the input data belongs to the i-th class. The real tag may take the form of a single heat vector, for example; for example, if the current category to be classified includes three categories of apple, banana and pear, the real labels of one of the three categories may be 100, 010 and 001, respectively; that is, when the true label is 100, the input data is indicated to belong to the category of "apple"; when the real label is 010, indicating that the input data belongs to the category of 'bananas'; and when the true label is 001, the input data is indicated to belong to the category of "pear". The number of bits of the unique heat vector corresponds to the number of categories to be classified. And P is as described above _i For example, may have a form corresponding to a real tag (0.4,0.6,1,0.2, … …), and the number of probability values included therein corresponds to the number of categories to be classified.

By loss-function loss of the primary classification model ₁ The method is characterized in that a cross entropy function is included, so that the primary classification model can reflect the similarity and the difference between the real mark distribution and the mark distribution predicted by the primary classification model for the input data, thereby facilitating the accurate distinction between correctly classified data and incorrectly classified data, and facilitating the further secondary classification of distinguished confusion data at the downstream. Furthermore, the user is not required to pay special attention to the classification performance of the primary classification model over the entire dataset.

In step S240, confusion data and confusion classes in the plurality of computer vision data are determined based on the predicted values, wherein the confusion classes are classes to which the confusion data are discriminated by the primary classification model.

In the present exemplary embodiment, the predicted value s belonging to each category based on the input data obtained through the full connection layer 2 _i The data discrimination module 440 shown in fig. 4 can be used to determine which data belongs to the data that is correctly classified after the primary classification, that is, the transparent data; and which data belongs to data that has been misclassified after the primary classification, i.e., confusing data. That is, the data discrimination module 440 may be used to discriminate between distinct data and confusing data; in addition, the data discriminating module 440 can also be used to discriminate the confusion class, that is, the class to which the confusion data is erroneously discriminated by the primary classification model, for example, when the primary classification model classifies the video related to apple into the class related to tomato, the class to which the video related to tomato belongs is a confusion class. The data discrimination module 440 may be implemented as a hardware unit or a software unit provided with a preset instruction program or discrimination algorithm.

In one example, in discriminating between distinct data and confusing data, for each data, the data discrimination module 440 may predict a predicted value s of the data belonging to the ith class according to the primary classification model _i And the true label of the data to distinguish whether the data belongs to distinct data or confusing data. The data discrimination module 440 may determine a predicted value s of the current data _i In (c), i.e. max(s) _i ) And compares the maximum value with the real tag gt of the current data. If it is determined that gt=max (s _i ) That is, the primary classification model predicts that the current data belongs to class i and that the result is equal to the true label of the current data, then the current data may be determined to belong to the distinct data that is correctly classified by the primary classification model. If it is determined that gt is not equal to max (s _i ) That is, the primary classification model predicts that the current data belongs to class i, but that the result is not equal to the true label of the current data, then the current data may be determined to belong to confounding data that was misclassified by the primary classification model. For distinct data that has been correctly classified in the computer vision data, as shown in FIG. 4, it is determined at the data discrimination module 440After the definition data is determined, the classification result of the definition data can be directly output.

After determining the confounding data, the confounding class may also be determined from the confounding data so that the downstream secondary classification model can more accurately classify the confounding data into the correct class based on the confounding class being the error comparison sample class. For this purpose, the predicted value s can be used _i Maximum value of (a), i.e. max(s) _i ) The product of the parameter gamma and a super parameter gamma is determined as a reference value, and the reference value is multiplied by each predicted value s _i A comparison is made. A predicted value s greater than the reference value can be used _i The corresponding class is determined as the confusing class confionD, i.e., confiond= { s _i >γ×max(s _i ) }. However, due to the determination condition s _i >max(s _i ) There is no conductivity, so a ReLu activation function can also be introduced to be conductive. Thus, the confusion class confionD may be:

confionD＝{ReLu(s _i >γ×max(s _i ) (equation 3)

In addition to the above way, the predicted value s corresponding to the true label of the data currently classified can also be used _i ^gt The product of the super parameter gamma is determined as a reference value, and a predicted value s greater than the reference value _i The corresponding category is determined as a confusion category, that is, the confusion category may be:

confionD＝{ReLu(s _i >γ×s _i ^gt ) ' formula 4

Compared with the former confusion type distinguishing mode, the confusion type distinguishing mode is often required to meet the precondition that the predicted value corresponding to the real label is larger. Therefore, in practical applications, the former confusion type discrimination method is preferable because it is universal.

In the above two confusion discrimination modes, the value range of the super parameter gamma can be between 0 and 1, and can be continuously adjusted according to the quality degree of the classification result of the system formed by the primary classification model and the secondary classification model, so that the data and confusion can be clarified according to the systemAnd the data classification result is that the duty ratio of the confusion class in the total classification class is continuously adjusted, so that the confusion data in the original input data are screened out as much as possible. For example, assume that there are 5 data, which 5 data are predicted by the primary classification model to belong to the prediction value s of the i-th class _i Respectively [0.1,0.4,0.05,0.3,0.25 ]]At this time, max(s) _i ) =0.4; if γ is set to 0.6, it is larger than γ×max (s _i ) I.e. a predicted value s greater than 0.24 _i The corresponding data is determined to be aliased, that is, data 2, 4, and 5 are all determined to be aliased. The super parameter γ can be regarded as the tolerance of the data discrimination module 440 to the aliased data. The smaller the value of the super parameter gamma, the more categories are judged as confusion, that is, the more data are judged as confusion data and are input into the secondary classification model; while the smaller the larger the value of γ, the smaller the class is determined as the confusion class, i.e., the smaller the data is determined as the confusion data and input into the secondary classification model.

By the above example, the confusion data and the confusion class in the computer vision data are determined, so that the classification result is directly output for the clear data which is easy to classify in the data; and for the mixed data which is difficult to classify in the data, the mixed data is distinguished from the transparent data, and the mixed data and the corresponding determined mixed class are sent to a subsequent secondary classification flow for further classification, so that the transparent data is classified without consuming the same time and resources as the mixed data, but the mixed data is classified independently, thereby improving the classification efficiency and saving the machine resources.

In step S250, the confusion data is sub-classified by a sub-classification model based on the confusion class, the first feature and the third feature, so as to obtain a classification result of the confusion data.

In the present exemplary embodiment, the secondary classification model classifies only the confounding data, that is, the only confounding data input to the secondary classification model is smaller in number than the original input data, but is erroneously classified by the primary classification model because the confounding data isClass data, the difficulty of correctly classifying the obfuscated data is therefore greater. To this end, as shown in FIG. 4, the confusion data and the confusion class determined by the data discrimination module 440 may be input into the secondary classification model 430, and a 512-dimensional first feature, which is an underlying feature, and a D, which is a higher-level feature, may be input ₁ (1024) The third feature of the dimension is also input to the secondary classification model 430 as a classification basis for the secondary classification. In one example, the first feature may be feature fused with a third feature, such as in a concat manner, in the secondary classification model 430 such that the feature map information for the first and third features may be integrated together in a manner that does not increase the amount of information under each feature for further data classification based on the integrated feature map information. After the fusion process, a fourth feature of 1536 dimensions can be obtained. The secondary classification model 430 may also include a full connection layer 4 through which the secondary classification model 430 may sub-classify the obfuscated data based on a fourth feature that includes both the underlying and higher-level features of the input data and the obfuscation class.

In one example, since the secondary classification model 430 is to be used for more complex tasks of distinguishing confounding data, in order to make the secondary classification model 430 focus more on learning of confounding classes, the loss function of the secondary classification model 430 may be set to include a distance function component and a cross entropy function component between the confounding classes. Wherein the cross entropy function component may be set, for example, to:

Wherein:

and, the distance function component between the confusion classes may be set, for example, to:

loss ₂₂ ＝∑ _i∈confionD relu(P _i -P ^gt ) (equation 7)

Wherein:

wherein s is _i Predicting a predicted value of the data which is currently classified and belongs to the ith class for the primary classification model, wherein the range of the predicted value is between 0 and 1; p (P) _i For the probability that the currently classified data belongs to the ith class, the P _i Is to s _i Carrying out normalization treatment to obtain the product; y is _i ' is the real label of the input data belonging to the i-th class; p (P) ^gt The probability that the data currently being classified belongs to the category indicated by the real tag is predicted for the secondary classification model.

While the loss function loss2 of the secondary classification model 430 may be set to:

loss2＝loss ₂₁ +λ*loss ₂₂ (equation 9)

The value of the parameter λ may range from 0 to 1, and may be adjusted continuously according to the accuracy of the classification result of the secondary classification model, and an exemplary preferred value may be 0.2. By setting the loss function of the secondary classification model 430 to include the distance function component and the cross entropy function component between the confusion classes, the probability P that the data belongs to the class indicated by the real tag can be continuously improved during training ^gt And continuously reducing the probability P that the data belongs to the confusion class _i That is, to make P _i -P ^gt As small as possible, the secondary classification model can pay more attention to study of the confusion class, and the capability of the secondary classification model for processing the confusion sample is further improved. By correctly classifying the confusion data, the secondary classification model can also correct the confusion class, so that the classified confusion data corresponds to the correct class.

In one example, the computer vision data classification method according to the present disclosure may further include a step of training the primary classification model and the secondary classification model, for example, a primary training step, a first training condition judgment step, a secondary training step, a second training condition judgment step, and a third training condition judgment step, before classifying the input plurality of computer vision data using the primary classification model and the secondary classification model. FIG. 5 schematically illustrates the decomposition steps of a process for training a primary classification model and a secondary classification model, wherein:

the primary training steps may include, for example, steps S501, S502, S503, and S504. Wherein, at S501, a plurality of computer vision training data for training, such as training video data having a resolution of 320×240, etc., may be input to the three-dimensional convolutional neural network as described above;

At S502, computer vision data features may be extracted from training data through a three-dimensional convolutional neural network in a manner as described above, resulting in first training features and second training features; and the second training features can be further extracted through the full connection layer 1 of the primary classification model, for example, so as to obtain third training features; wherein the first training feature is, for example, a bottom layer training feature and the third training feature is, for example, a high layer training feature;

in S503, the primary classification model may be trained based on the third training feature, for example, the primary classification model may perform primary training classification on the input multiple pieces of computer vision training data through the full-connection layer 2 based on the third training feature, so as to obtain training prediction values of which the training data output by the full-connection layer 2 belong to each training class;

at S504, the data discrimination module may be utilized to discriminate based on the training predictors in the manner described above, thereby determining the confusion training data and the confusion training class in the training data, where the confusion training class is the class to which the confusion training data is discriminated by the primary classification model.

The first training condition judgment step may include, for example, step S505. Therein, at S505, it may be determined whether a first training condition is reached. In one example, the first training condition may be, for example, that the number of times the primary classification model is trained reaches a preset primary classification training threshold, which may be set to, for example, 5; that is, when the process of inputting training data and training the primary classification model is repeatedly performed 5 times, it is determined that the first training condition is reached; otherwise, it is determined that the first training condition is not reached. If it is determined that the first training condition is not met, a return may be made to the primary training step and the "input training data-training" process repeated; if it is determined that the first training condition has been reached, a transition may be made to a secondary training step, i.e. the training of the secondary classification model is in turn performed.

The secondary training step may include, for example, step S506. In S506, the confusion training data and the corresponding confusion training class determined in S504 may be input into the secondary classification model in batches. For example, training is performed 5 times when training the primary classification model, and 5 batches of confusion training data and confusion training classes are correspondingly obtained; when the secondary classification model is input, the obtained confusion training data and confusion training classes can be input into the secondary classification model in batches, and each batch of confusion training data and confusion training classes corresponds to one training process of the secondary classification model. In addition to the confusion training data and the confusion training class, the first training feature and the third training feature may also be input into the secondary classification model, so that the secondary classification model performs secondary training classification on the confusion training data of each batch through the full connection layer 4 and based on the confusion training class and the first training feature and the third training feature, thereby obtaining a classification result of the confusion training data.

The second training condition judgment step may include, for example, step S507. Wherein, at S507, it may be determined whether a second training condition is reached. In one example, the second training condition may be, for example, that the number of times the secondary classification model is trained reaches a preset secondary classification training threshold, which may be set to, for example, 5; that is, when the process of inputting the confusion training data and the confusion training class and training the secondary classification model is repeatedly performed 5 times, it is determined that the second training condition is reached; otherwise, it is determined that the second training condition is not reached. If it is determined that the second training condition is not met, a return may be made to the secondary training step and the "input confusion training data-training" process repeated; if it is determined that the second training condition has been reached, it may be shifted to the third training condition judgment step.

The third training condition judgment step may include, for example, step S508. Wherein, at S508, it may be determined whether a third training condition is reached. In one example, the third training condition may include, for example, a) a sum of a number of times the primary classification model is trained and a number of times the secondary classification model is trained reaching a preset total training number; or b) the classification accuracy reaches a preset accuracy threshold. Wherein for case a), it may be empirically determined that when the total number of training times is 50, that is, when the process of "performing training of the primary classification model 5 times-performing training of the secondary classification model 5 times" is repeated 5 times, a more accurate training classification result can be obtained in general, it may be determined whether the sum of the number of times the training has been performed on the primary classification model and the secondary classification model has reached 50 times in this third training condition judgment step; if not, a return may be made to the primary training step and the process of "training the primary classification model-training the secondary classification model" may be repeated; if the total number of exercises is reached, the exercises may be stopped. For case b), for example, the classification accuracy of the system may be calculated, which may be the ratio of the number of correctly classified computer vision training data to the number of multiple computer vision training data, that is:

Classification accuracy = (number of distinct training data+number of correctly classified confusion training data)/total number of training data (equation 10)

The classification accuracy may reflect the training maturity of the primary classification model and the secondary classification model. For example, the accuracy threshold may be set to 0.95, that is, when the classification accuracy is not lower than 95%, the primary classification model and the secondary classification model have satisfactory classification ability, and at this time, the third training condition is reached and training may be stopped accordingly; otherwise, the classification capabilities of the primary classification model and the secondary classification model are considered to have not reached the standard, and return to the primary training step and repeat the process of "training the primary classification model-training the secondary classification model".

It should be noted that, the specific values listed in this example are all example values given for illustrative purposes, and in an application scenario, each value may be flexibly set according to actual requirements, for example, the total training number is set to 80 times, the accuracy threshold is set to 0.97, and the like, and is not limited to the values listed in this example.

The above example provides a "cross-type" model training approach, i.e., training the primary classification model and the secondary classification model alternately until the number of training iterations is reached. Traditional training methods may include "individual" training, in which the primary classification model is trained first, and then the secondary classification model is trained in turn, and "joint" training; the latter training process is, for example, "input training data-single training primary classification model-single training secondary classification model", and iterates continuously. The problem with "stand alone" training is that: because the trained primary classification model is adopted at the upstream of the secondary classification model, very little confusion training data enters the secondary classification model for training, so that the secondary classification model is under-fitted, and the classification performance aiming at the confusion data is greatly reduced; the problem with "joint" training is that: in the initial stage of training, all training data can be judged as confusion training data, so that a large amount of training data can directly enter the secondary classification model, the primary classification model cannot be well trained, and the classification performance of the primary classification model is reduced.

Compared with the two training modes, the cross training mode provided by the example can dynamically adjust the iterative training learning times in real time according to the learning conditions of the primary classification model and the secondary classification model, and fully play the roles of the primary classification model and the secondary classification model, so that the primary classification model and the secondary classification model have relatively balanced training effects in the training process, and the classification performance of the primary classification model and the secondary classification model is alternately and continuously improved through training until the satisfactory training effect is finally achieved. In this way, the training efficiency can be improved, and the classification performance of the primary classification model and the secondary classification model after training can be ensured at the same time, without the situation that the classification performance of a single classification model is low.

Exemplary apparatus

Having introduced a computer vision data classifying method according to an exemplary embodiment of the present disclosure, a computer vision data classifying apparatus according to an exemplary embodiment of the present disclosure is described next with reference to fig. 6. Wherein, the system embodiment part can inherit the related description in the method embodiment, so that the system embodiment can be supported by the related detailed description of the method embodiment.

Referring to fig. 6, the computer vision data classification apparatus 600 may include a feature extraction module 610, a primary classification module 620, a data discrimination module 630, and a secondary classification module 640, wherein:

the feature extraction module 610 may be configured to extract data features from a plurality of computer vision data input by a user to obtain a first feature and a second feature;

the primary classification module 620 may be configured to extract the second feature to obtain a third feature; and performing primary classification on the plurality of computer vision data based on the third feature to obtain predicted values of the plurality of computer vision data belonging to each category;

the data discriminating module 630 may be configured to determine, based on the predicted value, confusion data and confusion class in the plurality of computer vision data, where the confusion class is a class to which the confusion data is discriminated by the primary classification model; and

the secondary classification module 640 may be configured to perform secondary classification on the confusion data based on the first feature and the third feature to obtain a classification result of the confusion data.

Since each functional module of the computer vision data classifying device in the embodiment of the present disclosure is the same as that in the embodiment of the method invention, the description thereof is omitted herein.

Exemplary apparatus

Next, an electronic device of an exemplary embodiment of the present disclosure will be described. The electronic device of the exemplary embodiment of the disclosure comprises the computer vision data classification device.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the present disclosure may include at least one processing unit, and at least one storage unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the computer vision data classification method according to various exemplary embodiments of the disclosure described in the "method" section of the present specification. For example, the processing unit may perform steps S210 to S250 as described in fig. 2.

An electronic device 700 according to such an embodiment of the present disclosure is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the system operation are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT) display, a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU) 701, performs the various functions defined in the methods and systems of the present application.

Exemplary program product

In some possible embodiments, the various aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of the computer vision data classification method according to the various exemplary embodiments of the present disclosure described in the "method" section of this specification, when the program product is run on the terminal device, for example, the terminal device may perform the steps S210 to S250 as described in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In addition, as technology advances, readable storage media should also be interpreted accordingly.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that while several modules or sub-modules of a computer vision data classification device are mentioned in the detailed description above, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.

Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of classifying computer vision data, comprising:

extracting computer vision data features from the input plurality of computer vision data to obtain first features and second features;

extracting the second feature through a primary classification model to obtain a third feature; the third feature is a high-level feature for capturing global semantic information;

performing primary classification on the plurality of computer vision data based on the third feature through the primary classification model to obtain predicted values of the plurality of computer vision data belonging to each category;

determining confusion data and confusion classes in the plurality of computer vision data based on the predicted values, wherein the confusion classes are classes to which the confusion data are judged by the primary classification model;

and performing secondary classification on the confusion data based on the confusion class, the first feature and the third feature through a secondary classification model to obtain a classification result of the confusion data.

2. The computer vision data classification method of claim 1, wherein the loss function of the primary classification model comprises a cross entropy function and the loss function of the secondary classification model comprises a distance function component and a cross entropy function component between confusion classes.

3. The computer-vision data classification method of claim 1, wherein the determining confusion data and confusion classes in the plurality of computer-vision data based on the predicted values comprises:

determining a maximum value of the predicted values of each computer vision data and comparing the maximum value with a true label of each computer vision data;

determining computer vision data corresponding to a maximum value unequal to the real label as confusion data; and

determining the product of the maximum value and a super parameter as a reference value;

and determining the category corresponding to the predicted value larger than the reference value as the confusion category.

4. The computer vision data classification method of claim 1, wherein prior to said extracting computer vision data features from the input plurality of computer vision data, the method further comprises:

a primary training step, which comprises inputting a plurality of computer vision training data, and training the primary classification model based on the plurality of computer vision training data to determine confusion training data and confusion training classes in the plurality of computer vision training data, wherein the confusion training classes are classes to which the confusion training data are judged by the primary classification model;

A first training condition judgment step of determining whether a first training condition is reached;

if not, returning to the primary training step;

if yes, turning to a secondary training step, wherein the secondary training step comprises training the secondary classification model based on the confusion training data and the confusion training class;

a second training condition judgment step of determining whether a second training condition is reached;

if not, returning to the secondary training step;

if yes, the method goes to a third training condition judging step, wherein the third training condition judging step comprises determining whether a third training condition is reached;

if not, returning to the primary training step;

if so, the training is stopped.

5. The computer vision data classification method of claim 4, wherein the training the primary classification model based on the plurality of computer vision training data comprises:

extracting computer vision data features from the plurality of computer vision training data to obtain first training features and second training features;

Extracting the second training features through a primary classification model to obtain third training features;

performing primary training classification on the plurality of computer vision training data based on the third training features through the primary classification model to obtain training predicted values of the plurality of computer vision training data belonging to each training class;

determining the confusion training data and the confusion training class in the plurality of computer vision training data based on the training predictor.

6. The computer-vision data classification method of claim 5, wherein the training the secondary classification model based on the confusion training data and the confusion training class comprises:

and performing secondary training classification on the confusion training data through the secondary classification model and based on the confusion training class, the first training feature and the third training feature so as to obtain a classification result of the confusion training data.

7. The computer vision data classification method of claim 4, wherein the determining whether the first training condition is reached comprises:

determining whether the number of times of training the primary classification model reaches a preset primary classification training threshold;

The determining whether the second training condition is reached includes:

determining whether the number of times of training the secondary classification model reaches a preset secondary classification training threshold;

the determining whether the third training condition is reached includes:

determining whether the sum of the number of times the primary classification model is trained and the number of times the secondary classification model is trained reaches a preset total training number; or (b)

Whether the classification accuracy reaches a preset accuracy threshold or not, wherein the classification accuracy is the ratio of the number of correctly classified computer vision training data to the number of the plurality of computer vision training data.

8. The computer vision data classification method of claim 1, wherein the extracting data features from the input plurality of computer vision data comprises:

a data feature is extracted from the plurality of computer vision data using a three-dimensional deep neural network or a two-dimensional deep neural network.

9. The computer-vision data classification method of claim 6, wherein the sub-classifying the confounding data based on the first feature and the third feature comprises:

And carrying out feature fusion processing on the first feature and the third feature to obtain a fourth feature, and carrying out secondary classification on the confusion data based on the fourth feature.

10. A computer vision data classification device, comprising:

the feature extraction module is used for extracting data features from a plurality of computer vision data input by a user to obtain a first feature and a second feature;

the primary classification module is used for extracting the second features to obtain third features; and performing primary classification on the plurality of computer vision data based on the third feature to obtain predicted values of the plurality of computer vision data belonging to each category; the third feature is a high-level feature for capturing global semantic information;

the data judging module is used for determining confusion data and confusion classes in the plurality of computer vision data based on the predicted value, wherein the confusion classes are the classes to which the confusion data are judged by the primary classifying module; and

and the secondary classification module is used for carrying out secondary classification on the confusion data based on the first feature and the third feature so as to obtain a classification result of the confusion data.

11. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the computer vision data classification method of any of claims 1-9 based on instructions stored in the memory.

12. A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a computer-vision data classification method as claimed in any one of claims 1-9.