CN115130573A

CN115130573A - Data processing method, device, storage medium, equipment and product

Info

Publication number: CN115130573A
Application number: CN202210734509.3A
Authority: CN
Inventors: 翁运鹏; 陈亮; 何秀强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-30

Abstract

A data processing method, apparatus, storage medium, device and product, the method comprising: acquiring at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions; fitting sample characteristics of each sample data under different characteristic dimensions according to corresponding class label labels to obtain a fitting result so as to indicate the corresponding importance of different characteristic dimensions in the process of adding corresponding class label labels to each sample data; according to the importance degree indicated by the fitting result, the feature dimensions meeting the selection condition are used as a conversion basis in the feature conversion process; the transformation is based on interpretable analysis results used to generate the feature transformation. By the method, interpretability analysis of the characteristic conversion process can be realized.

Description

Data processing method, device, storage medium, equipment and product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, storage medium, device, and product.

Background

Complex models such as a deep neural network can abstract sample data into a new vector space through feature conversion, and the extraction capability of original information is increased, so that excellent performance is presented on most tasks. However, complex models such as deep neural networks are equivalent to a black box, people cannot understand why some decisions are made by the black box models, and therefore interpretable analysis of the feature transformation process of the black box models is necessary.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, a storage medium, equipment and a product, and can realize interpretable analysis of a feature conversion process.

In one aspect, an embodiment of the present application provides a data processing method, where the method includes:

acquiring at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions;

fitting sample characteristics of each sample data under different characteristic dimensions according to corresponding class label labels to obtain a fitting result so as to indicate the corresponding importance of different characteristic dimensions in the process of adding corresponding class label labels to each sample data;

according to the importance degree indicated by the fitting result, the feature dimension meeting the selection condition is used as a conversion basis in the feature conversion process; the transformation is based on interpretable analysis results used to generate the feature transformation.

In one aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus includes:

the device comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for acquiring at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions;

the processing unit is used for fitting sample characteristics of each sample data under different characteristic dimensions according to the corresponding class label to obtain a fitting result so as to indicate the corresponding importance of different characteristic dimensions in the process of adding the corresponding class label to each sample data;

the processing unit is further configured to use the feature dimensions meeting the selection condition as a conversion basis in the feature conversion process according to the importance indicated by the fitting result; the transformation is based on interpretable analysis results used to generate the feature transformation.

In one aspect, the present application provides a computer device, where the computer device includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores a computer program, and the processor is configured to invoke the computer program to execute the data processing method according to any one of the foregoing possible implementation manners.

In one aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements the data processing method of any one of the possible implementations.

In one aspect, an embodiment of the present application further provides a computer program product, where the computer program product includes a computer program or computer instructions, and the computer program or the computer instructions are executed by a processor to implement the steps of the data processing method provided in the embodiment of the present application.

In an aspect, an embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided in the embodiment of the present application.

In the embodiment of the application, at least one sample data and a category label of any sample data can be acquired, the category label of any sample data is obtained by performing feature conversion on the sample features of any sample data under different feature dimensions, fitting processing is carried out on the sample characteristics of each sample data under different characteristic dimensions according to the corresponding category label, so as to obtain a fitting result, the fitting result can indicate the importance degree corresponding to different feature dimensions in the process of adding the corresponding category label to each sample data, therefore, the feature dimensions meeting the selection condition can be used as the transformation basis in the feature transformation process through the importance indicated by the fitting result, that is, the feature dimensions that satisfy the selection condition are important factors for distinguishing sample data in the feature transformation process, and interpretability analysis results of the feature transformation can be generated based on the transformation. By the embodiment of the application, interpretability analysis of the feature conversion process can be realized.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive labor.

Fig. 1 is a schematic diagram of data processing according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a target network model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of obtaining at least one sample data according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another data processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a decision path of a target network model according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating a decision path of another target network model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it should be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a data processing method, which can realize interpretable analysis of a feature conversion process and can be applied to various fields or scenes such as cloud technology, artificial intelligence, block chains, Internet of vehicles, intelligent transportation, intelligent home furnishing and the like. In one embodiment, the data processing method can be implemented based on machine learning techniques in artificial intelligence techniques. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large video processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like. Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. For example, in the embodiment of the present application, a machine learning technique is adopted to perform fitting processing on sample data and corresponding class label, so as to determine a transformation basis in the feature transformation process according to an obtained fitting result.

The execution subject of the data processing method provided by the application is computer equipment, and the computer equipment can comprise one or more of a terminal, a server and the like. That is, the data processing method proposed in the embodiment of the present application may be executed by a terminal, may be executed by a server, or may be executed by both a terminal and a server capable of communicating with each other.

The terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

Complex models such as a deep neural network and the like have a large number of nonlinear network layers, the combination of the large number of nonlinear network layers can extract and characterize original data on various abstract levels, but the model has high complexity and a large number of parameters, and people cannot understand how the decision is made by the end-to-end mode. The transparent model is a model which is simple in structure and can be intuitively understood, such as a logistic regression model, a decision tree model, a naive Bayes model and the like, and can explain all links in the process from data input to output prediction. Therefore, as shown in fig. 1, in the data processing method provided by the present application, after performing feature transformation on sample data to obtain a sample representation, a computer device may add a class label to the sample data by using a class attribute of the sample data reflected by the sample representation, and then fit the sample data to the corresponding class label through the transparent model to obtain a fitting result, where the fitting result is actually an interpretable analysis result of the transparent model, and may reflect an influence degree of different feature dimensions of the sample data on fitting the corresponding class label when the sample data is fitted to the corresponding class label. Therefore, the importance degree corresponding to different feature dimensions in the process of adding the corresponding category label to each sample data is indicated through the fitting result, and the feature dimensions in one sample data can be known to be important factors in the feature conversion process through the importance degree indicated by the fitting result, so that the conversion basis in the feature conversion process is obtained, and the interpretability analysis result of the feature conversion is generated based on the conversion basis, so that the interpretability analysis of the feature conversion process is realized.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a data processing method according to an embodiment of the present disclosure. The method may be applied to the computer device of fig. 1 described above, the method comprising.

S201, obtaining at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions.

In the embodiment of the present application, a sample data may be used to describe an object, which refers to an objectively existing thing, such as a person, a flower, an animal, and the like. One or more characteristic dimensions are obtained by multi-dimensionally dividing the object, for example, a person can be divided according to height and weight, a flower can be divided according to color, size and growth period, and an animal can be divided according to flower pattern and body temperature. The sample feature in the feature dimension refers to a value corresponding to the feature dimension, for example, a height of a person is 156 cm, a color of a flower is red, and a body temperature of an animal is constant. It should be noted that each sample data in the at least one sample data includes one or more same feature dimensions.

In an embodiment, the computer device may perform feature transformation on sample features of any sample data in different feature dimensions to obtain a feature vector corresponding to any sample data. Then, based on the feature vector corresponding to any sample data, the sample data with high similarity between corresponding feature vectors is divided into one data set, and the sample data with larger difference of corresponding feature vectors is divided into different data sets. Finally, the class label of any sample data is determined based on the data set where the sample data is located, wherein one class label may correspond to (indicate) one data set.

The process of performing feature conversion on the sample data is equivalent to a processing process of a black box model, the black box model analyzes sample features under each feature dimension in the sample data, and determines which class attribute the sample data tends to belong to, so that a feature vector which is easy to distinguish the sample data as the class attribute is generated. Taking a common image classification model as an example, if a real label of an image is a cat, an image feature which is easy to be identified as the cat is generated, and meanwhile, the image feature is greatly different from an image feature of which the real label is another category, so that the model can be classified. Therefore, the higher the similarity between the corresponding feature vectors, the higher the probability that the sample data is classified into the same category attribute, and the larger the difference between the corresponding feature vectors, the lower the probability that the sample data is classified into the same category attribute. The above process of partitioning at least one sample data by a feature vector is equivalent to determining to which class attribute (indicated by a class label) the feature transformation process tends to attribute each sample data. Since the feature transformation process often involves a large number of non-linear mappings, one cannot know which feature dimensions in the sample data to perform based analysis when transforming features.

S202, fitting sample characteristics of each sample datum under different characteristic dimensions according to corresponding class label labels to obtain a fitting result so as to indicate the corresponding importance of different characteristic dimensions in the process of adding corresponding class label labels to each sample datum.

In the embodiment of the application, the sample characteristics of each sample datum under different characteristic dimensions are subjected to fitting processing according to the corresponding category label, so that a fitting result is obtained. The process of the fitting processing is equivalent to a process of inputting sample characteristics of any sample data under different characteristic dimensions and outputting a class label of the any sample data, so that the essence of the fitting processing process and the characteristic conversion process is that the class label to which the sample data tends to belong is judged by analyzing the sample characteristics under each characteristic dimension in the sample data. The fitting process can be realized through a transparent model, and as the transparent model can explain all links in the process from data input to output prediction, the interpretability analysis can be carried out on the characteristic conversion process through the interpretability result of the transparent model. The transparent model or the fitting processing process is subjected to interpretability analysis through a fitting result, the fitting result can reflect the influence degree of different feature dimensions of any sample data when the corresponding class label is fitted, and the influence degree can reflect the corresponding importance degree of the feature dimensions in the process of adding the corresponding class label to each sample data.

Specifically, a fitting model can be determined through the fitting process, the sample characteristics of any sample data under different characteristic dimensions are used as the input of the fitting model, and the target weight w in the fitting model is used _jm And transforming the sample data to obtain the output of a fitting model, wherein the output of the fitting model can indicate the class label of any sample data. For example, when any sample data is input into the fitting model, the output of the fitting model is the randomThe prediction probabilities of different class label labels are added to sample data, and the class label with the maximum prediction probability is the class label of any sample data. At this time, the target weight w _jm The absolute value of (2) can reflect the influence degree of the jth characteristic dimension contained in any sample data on the prediction probability when the prediction probability of the mth class label added to the sample data is determined by the fitting model. In addition, the target weight w _jm If the number is positive, it indicates that the jth feature dimension is positively correlated with the prediction probability, that is, the larger the sample feature under the jth feature dimension is, the larger the prediction probability is; target weight w _jm If the number is negative, it indicates that the jth feature dimension is negatively correlated with the prediction probability, i.e., the larger the sample feature in the jth feature dimension is, the smaller the prediction probability is. Wherein j and m are integers greater than 0.

Pass the above target weight w _jm It can be seen that if the target weight w _jm The larger the sample feature is, the larger the sample feature in the jth feature dimension in one sample data is, the easier the sample data is to be added with the mth category label. If the target weight w _jm The sample feature under the jth feature dimension in one sample data is larger, and the sample data is less likely to be added with the mth category label. If the target weight w _jm The smaller the absolute value of the sample data is, the less the influence of the sample feature under the jth feature dimension in one sample data on the mth class label which is not added to the sample data is. Thus weighting the target w _jm As a fitting result, it can be known which feature dimensions correspond to a higher importance degree in the feature conversion process, and how the feature dimensions influence the discrimination of the class labeling labels of the sample data.

S203, according to the importance degree indicated by the fitting result, taking the feature dimension meeting the selection condition as a conversion basis in the feature conversion process; the transformation is based on interpretable analysis results used to generate the feature transformation.

Sample data can be abstracted to a high-dimensional space (such as a vector space) through feature transformation, but the information learned in the feature transformation process is difficult to interpret, and people cannot know which feature dimension or dimensions play an important role in the feature transformation process and how to play the role, namely the transformation basis in the feature transformation process is not known.

The fitting processing also relates to the step of judging which class the sample data is inclined to label the label through the sample data under different feature dimensions in the sample data, so that interpretability analysis of the feature conversion step can be realized based on the fitting of the sample data, namely, the feature dimensions meeting the selection conditions can be used as conversion bases in the feature conversion process through the importance degree indicated by the fitting result. Wherein the importance indicated by the fitting result is given by the target weight w _jm And (4) quantization. Understandably, when the target weight w _jm When the number is positive and the larger the number is, the larger the sample feature in the jth feature dimension in one sample data is, the easier the mth category label is to be added. And each class label corresponds to a data set, so that when the feature is converted, the larger the sample feature of the jth feature dimension in one sample data is, the more easily the sample data generates similar features with the sample data in the data set corresponding to the mth class label.

In an implementation manner, sequentially selecting feature dimensions of a target number according to a sequence of importance degrees indicated by fitting results from large to small, and taking the selected feature dimensions as a conversion basis in a feature conversion process, the method includes: assuming that one sample data contains sample features under d feature dimensions, a target weight set W exists for the m-th class label _m ＝{w _1m ,w _2m ,...,w _dm Can be set according to the target weight W _m ＝{w _1m ,w _2m ,...,w _dm Sequentially selecting target weights of a target quantity (which can be set manually) from large to small of each target weight in the data, taking feature dimensions corresponding to the target weights of the target quantity as conversion bases in a feature conversion process, wherein in the feature conversion process, the target weights of the target quantity in one sample dataThe larger the sample characteristic under the characteristic dimension corresponding to the target weight is, the easier the sample data and the sample data in the data set corresponding to the mth class label generate similar characteristics, namely, the interpretability analysis result of the characteristic conversion.

In another implementation manner, selecting a feature dimension with a corresponding importance degree greater than an importance degree threshold according to the importance degree indicated by the fitting result, and using the selected feature dimension as a conversion basis in the feature conversion process includes: assuming that one sample data contains sample features under d feature dimensions, a target weight set W exists for the m-th class label _m ＝{w _1m ,w _2m ,...,w _dm And in the feature conversion process, the larger the sample feature of the feature dimension corresponding to the target weight greater than the weight threshold in one sample data is, the easier the one sample data generates similar features with the sample data in the data set corresponding to the mth category label, that is, the interpretability analysis result of the feature conversion is.

In the embodiment of the application, after the sample data is subjected to feature conversion to obtain sample characterization (namely, a feature vector), a class label is added to the sample data by using the class attribute of the sample data reflected by the sample characterization, and then the sample data is fitted with the corresponding class label to obtain a fitting result, wherein the fitting result can indicate the importance degree corresponding to different feature dimensions in the process of adding the corresponding class label to each sample data. Then, through the importance indicated by the fitting result, it can be known which feature dimensions are important in the feature transformation process, and it is known how the feature dimensions can affect the discrimination of the class label of the sample data, so that it can be determined how the feature dimensions affect the feature transformation of the sample data, the transformation basis and the interpretability analysis result of the feature transformation process are obtained, and the interpretability analysis of the feature transformation process can be realized.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating another data processing method according to an embodiment of the present disclosure. The method can be applied to the computer device in fig. 1, and the method comprises the following steps:

s301, obtaining at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions.

In an embodiment, feature transformation may be performed on sample features of each sample data in different feature dimensions to obtain a feature vector corresponding to each sample data. And the feature vector corresponding to each sample data is obtained by calling the ith network hidden layer of the target network model containing the classification function to perform feature conversion. The target network model may include N network hidden layers, where N is a positive integer greater than or equal to 1, and i is a positive integer greater than 0 and less than or equal to N.

The target network model may be a simple classification model, for example, the sample data may relate to a plurality of feature dimensions such as the size, color, height, etc. of the flower, and the target network model may predict what kind of flower the sample data is; the target network model may predict the possibility of the object purchasing when the object recommends the commodity through the sample data. Wherein, the times of purchasing commodities and the average price of purchasing commodities of the object in each month are acquired after obtaining the permission or the consent of the user.

As shown in fig. 4, the target network model may be a deep neural network including N hidden network layers, sample features of each sample data in different feature dimensions may be input into the target network model, and a first hidden network layer of the target network model may transform and abstract the input sample features into a new vector space to output each sample featureCharacteristic vector h corresponding to sample data ₁ And further apply the feature vector h ₁ Inputting the data into the next network hidden layer to output a feature vector h corresponding to each sample data ₂ And by superposing N network hidden layers, the eigenvector h corresponding to each sample data can be obtained _N . In one embodiment, invoking an ith network hidden layer of a target network model including a classification function to perform feature transformation to obtain a feature vector corresponding to each sample data, including: inputting sample characteristics of each sample data under different characteristic dimensions into a target network model, wherein the first i layers in the target network model can be overlapped layer by layer, and a characteristic vector h is obtained by converting abstraction _i-1 Further, the feature vector h is obtained _i-1 Inputting the ith network hidden layer to obtain a feature vector h corresponding to each sample data _i 。

It should be noted that the ith network hidden layer is the feature vector h to be input _i-1 Transformed into a feature vector h _i The process of (2) is a feature transformation process, so that an interpretability analysis result obtained by performing feature transformation on at least one sample data based on the ith network hidden layer is used as an interpretability analysis result of the ith network hidden layer. The subsequent steps are actually to perform interpretability analysis on the i network hidden layer.

Further, based on the feature vector corresponding to each sample data, performing clustering operation on at least one sample data to divide at least one sample data into different data sets. The clustering algorithm is an unsupervised method, namely real labels of sample data are not needed to be used in the training process, and the sample data with similar characteristics are classified into the same data set through the characteristics of the sample data. In a feasible implementation manner, a Clustering algorithm can be adopted to perform Clustering processing on the feature vector corresponding to each sample data, the Clustering algorithm can be a K-means algorithm (hard Clustering algorithm), a DBSCAN (Density-Based Clustering of Applications with Noise) algorithm, and the like, taking the K-means algorithm as an example, firstly, selecting M (integers greater than 0) initial Clustering centers { v } ₁ ,v ₂ ,...,V _m }; calculating the corresponding characteristics of each sample dataThe distance between the vector and M initial clustering centers is assigned to the cluster with the closest distance; calculating the average characteristic vector in the current cluster as a new cluster center aiming at each cluster; and fourthly, repeating the third step and the fourth step until a termination condition is reached, wherein the termination condition can be a preset upper limit value of iteration times or the characteristic vectors in various clusters are not changed.

By the clustering processing, the characteristic vector corresponding to each sample data can be divided into different clusters, and further, by the corresponding relation between the characteristic vector and the sample data, clustering operation can be performed on at least one sample data, namely the characteristic vectors belonging to the same cluster, the corresponding sample data is divided into the same data set, so that at least one sample data is divided into different data sets. Understandably, because there are M cluster centers, M cluster clusters are divided, and M data sets are obtained.

Wherein one data set corresponds to one cluster label. In an implementation manner, after clustering operation is performed on at least one sample data, the total amount of the obtained data sets is determined, and a set number corresponding to each obtained data set is determined based on the total amount of the data sets. For example, there are three data sets, and the set numbers corresponding to the three data sets are 1, 2, and 3, respectively. The set number corresponding to one data set can be used as a cluster label corresponding to one data set, and then the cluster label corresponding to the data set where any sample data is located is used as a category label of any sample data. Therefore, the class label of any sample data may indicate the data set where the sample data is located, for example, if the class label of any sample data is 3, it indicates that the set number and the cluster label of the data set where the sample data is located are 3.

In a feasible embodiment, at least one reference sample data used for performing interpretable analysis on the i-1 network hidden layer in the target network model may be acquired, at least two reference sample sets obtained after performing clustering operation on the at least one reference sample data in the process of performing interpretable analysis on the i-1 network hidden layer are acquired, and the reference sample data included in any one of the reference sample sets is respectively used as the acquired at least one sample data. As shown in fig. 5, when performing interpretability analysis on the i-1 th network hidden layer, n (an integer greater than 1) reference sample sets are obtained by partitioning, when performing interpretability analysis on the i-1 th network hidden layer, any one of the reference sample sets may be used as at least one sample data for performing interpretable analysis on the i-1 th network hidden layer, and in the process of performing interpretable analysis on the i-1 th network hidden layer, a plurality of (n) reference sample sets corresponding to the i-1 th network hidden layer may be obtained by further partitioning with respect to any one of the reference sample sets corresponding to the i-1 th network hidden layer.

There may be a true tag for each sample data, which may be determined based on the classification task of the target network model. For example, when the target network model is to judge whether a person purchases a commodity, the real tag may include purchase and non-purchase; when the target network model is to identify a group to which an animal belongs, the real tag may include amphibians, mammals, and reptiles. In one implementation, M may be the number of real tags, for example, the real tags include purchase and non-purchase, and M is 2; the real label comprises amphibians, mammals and reptiles, and M is 3. The real label of each sample data contained in any data set can be obtained, and the number of samples corresponding to the same real label is determined. And then, calculating the sample proportion corresponding to the same real label based on the number of samples corresponding to the same real label and the total amount of sample data in any data set, namely, the sample proportion corresponding to the same real label is equal to the number of samples corresponding to the same real label/the total amount of sample data in any data set, so as to obtain the sample proportions corresponding to different real labels respectively. Further, the true label of the maximum sample ratio corresponding to any data set can be obtained. If the real labels of the maximum sample proportions corresponding to the M data sets are different and the proportions are larger than the proportion threshold (which can be set manually), the fact that the target network model has good feature conversion capability to the ith network hidden layer can be determined, classification tasks can be completed well, at the moment, pruning processing can be carried out on the target network model, and only the ith network hidden layer of the target network model and the network hidden layers before the ith network hidden layer are reserved. For example, if the real label of the maximum sample ratio corresponding to the data set 1 is amphibian and the amphibian ratio is 98%, the real label of the maximum sample ratio corresponding to the data set 2 is mammal and the mammal ratio is 98%, and the real label of the maximum sample ratio corresponding to the data set 3 is reptile and the reptile ratio is 98%, it is considered that the target network model has good feature transformation capability to the i-th hidden network layer.

S302, obtaining a target algorithm for performing interpretability analysis.

The target algorithm can be a transparent model with self-interpretation capability, and the transparent model is a model which is simple in structure and can be intuitively understood. The transparent model is exemplified as a logistic regression model in the present application. Assume that one sample data includes a sample feature x ═ x (x is an integer greater than 0) in d feature dimensions ₁ ,x ₂ ,...,x _d ) The expression of the logistic regression model is a mapping relationship from sample data x to prediction probability f (x) learned by training a set of values θ ═ W, b as shown in the following expression (1).

f(x)＝σ(W ^T x+b)(1)

Wherein W is (W) ₁ ,w ₂ ,...,w _d ) W denotes the prediction weight, b denotes the bias term, W and b are adjustable model parameters, which can be obtained by initializing the parameters, T denotes the transposition operation, i.e. W denotes the transposition operation ^T The method is obtained by performing transposition operation on W, and x represents input sample data; sigma () represents sigmoid function, and can limit the value range of f (x) to [0, 1%]Where f (x) represents the prediction probability that one sample data x is a positive sample, and 1-f (x) represents the prediction probability that one sample data x is a negative sample. Wherein W in W _j May reflect the magnitude of the influence of the jth feature dimension on the prediction probability f (x). In addition, w _j If the number is positive, it means that the corresponding feature dimension is positively correlated with the prediction probability f (x), that is, the larger the sample feature in the jth feature dimension in one sample data is, the larger the prediction probability f (x) is, otherwise, it is negatively correlated.

The above formula (1) is an expression of a logistic regression model applied to the two-class classification, and the following formula (2) is an expression of a multi-class logistic regression model configured based on a plurality of logistic regression models of the two-class classification.

Where P (y ═ m | x, W) represents a prediction probability that one sample data x is classified into the mth class, and W represents a prediction probability that one sample data x is classified into the mth class _m ＝(w _1m ,w _2m ,...,w _dm ) Is the predicted weight, W _m W in _jm The absolute value of (a) may reflect the degree of influence of the jth feature dimension on the prediction probability P (y ═ m | x, W) when one sample data x is classified into the mth class.

And S303, fitting the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding class label by adopting a target algorithm to obtain a fitting result so as to indicate the importance degree corresponding to different characteristic dimensions in the process of adding the corresponding class label to each sample datum.

In an embodiment, a target algorithm is adopted to perform fitting processing on sample characteristics of each sample datum under different characteristic dimensions according to corresponding category label labels to obtain a fitting result, and the fitting processing includes: and performing label prediction processing on the sample characteristics of each sample data under different characteristic dimensions by adopting a target algorithm to obtain a class prediction result of each sample data. Specifically, the formula (2) may be converted to the following formula (3).

Taking the sample characteristics of each sample data under different characteristic dimensions as the input x ═ x (x) of the multi-classification logistic regression model shown in the formula (3) ₁ ,x ₂ ,...,x _d ) To output a class prediction result for each sample data: p (y ═ 1| x, W), P (y ═ 2| x, W),. -, P (y ═ M | x, W). The class prediction result for each sample data may indicate that the each sample data is classified into a different classLabeling the predicted probability of the label. Because there are M data sets, one data set corresponding to one category label, there are M category labels.

Further, the prediction weight of the target algorithm may be adjusted by using the class prediction result of each sample data and the class label of each sample data, that is, the prediction weight W in equation (3) is adjusted ₁ 、W ₂ 、...、W _M And (6) adjusting. For example, a loss function suitable for multi-classification, such as a cross-entropy loss function whose expression is shown in the following equation (4), may be obtained.

Where loss1 represents the first loss value, S represents the number of samples of at least one sample data, x _j Denotes the jth sample data, P (y ═ m | x) _j W) represents the predicted probability that the jth sample data is added as the mth class label, y _jm Representing a symbolic function, which takes the value 0 or 1, if the sample data x _j Is the m-th class label, then y _jm Is 1, the rest is 0.

Then, the class prediction result for each sample data is substituted into P (y ═ m | x) in formula (4) _j W) and determining y in equation (4) based on the class label of each sample data _jm Thereby, the formula (4) can output the first loss value, and the prediction weight W in the formula (3) can be adjusted by the first loss value and the stochastic gradient descent method ₁ 、W ₂ 、...、W _M And repeatedly carrying out adjustment for multiple times, and when the stopping condition is met, if the specified number of adjustment times is reached or the cross entropy loss function is converged, meeting the stopping condition. And taking the adjusted prediction weight as a fitting result. The adjusted prediction weight W ₁ ＝(w ₁₁ ,w ₂₁ ,...,w _d1 )、W ₂ ＝(w ₁₂ ,w ₂₂ ,...,w _d2 )、...、W _M ＝(w _1M ,w _2M ,...,w _dM ) Namely onThe objective weights in S202 and S203 are multi-class logistic regression models including the objective weights, which are the fitting models in S202.

In another embodiment, if the number of the obtained tag types of the category label is at least one, a target algorithm is adopted to perform fitting processing on the sample features of each sample datum under different feature dimensions according to the corresponding category label to obtain a fitting result, where the fitting processing includes: and selecting the category label of the target label type from the acquired category label, and taking the selected category label of the target label type as a fitting target. Specifically, one tag type may be sequentially selected from the at least one tag type as a target tag type, so as to select a category label of the target tag type.

Further, based on the fitting target, a target algorithm is adopted, and fitting processing is carried out on the sample characteristics of each sample data under different characteristic dimensions according to the corresponding category label, so that a fitting result is obtained. Assuming that a class label 1 is used as a fitting target, sample features of one sample data under different feature dimensions are used as input x (x is) of a two-class logistic regression model shown in formula (1) ₁ ,x ₂ ,...,x _d ) To output a prediction probability f (x) of one sample data x. Further, an expression of a logarithmic loss function suitable for the two-class shown in the following expression (5) is obtained.

Wherein loss2 represents the second loss value, f (x) _j ) Denotes the prediction probability of the jth sample data as a positive sample, y _j A label representing jth sample data, if the class label of the jth sample data is a fitting target, y _j Is 1, otherwise is 0. For example, when the class label tag 1 is the fitting target, if the class label tag of the jth sample data is the class label tag 1, then y is _j Is 1, otherwise is 0.

In an embodiment, canSubstituting the prediction probability of each sample data into f (x) in equation (5) _j ) And determining y in equation (5) based on whether the class label of each sample data is a fitting target _j In this way, the second loss value can be output by equation (5), and the prediction weight W in equation (1) can be adjusted by the second loss value and the random gradient descent method (W ═ W) ₁ ,w ₂ ,...,w _d ) The adjustment is repeated for a plurality of times, and when the stop condition is satisfied, the stop condition is satisfied if a specified number of adjustment times is reached or the logarithmic loss function converges. The adjusted prediction weight W is equal to (W) ₁ ,w ₂ ,...,w _d ) The fitting result obtained as a result of the fitting process based on the fitted target is a fitting result associated with the fitted target. I.e. the adjusted prediction weight W ═ W (W) ₁ ,w ₂ ,...,w _d ) Is compared to the fit target: and labeling the fitting result associated with the label 1 by the category. Then, the class label 2 can be used as a fitting target, and the fitting result associated with the class label 2 can be obtained in the same manner. It can be understood that, assuming that there are M class label tags and each class label tag has an associated fitting result, the fitting result W associated with the mth class label tag is (W ═ W) ₁ ,w ₂ ,...,w _d ) I.e., corresponding to W in the above formula (4) _m I.e., the target weights in S202 and S203 described above. Meanwhile, a plurality of two-class logistic regression models are obtained, which are the fitting models in S202.

S304, according to the importance indicated by the fitting result, taking the feature dimension meeting the selection condition as a conversion basis in the feature conversion process; the transformation is based on interpretable analysis results used to generate the feature transformation.

In an implementation manner, sequentially selecting feature dimensions of a target number according to a sequence of importance degrees indicated by fitting results from large to small, and taking the selected feature dimensions as a conversion basis in a feature conversion process, the method includes: assuming that one sample data contains sample features under d feature dimensions, a target weight set W exists for the m-th class label _m ＝{w _1m ,w _2m ,...,w _dm Can be set according to the target weight W _m ＝{w _1m ,w _2m ,...,w _dm Sequentially selecting target weights of a target quantity (which can be set manually) from the largest target weight to the smallest target weight in the data set, taking a feature dimension corresponding to the target weight of the target quantity as a conversion basis in a feature conversion process, wherein in the feature conversion process, the larger a sample feature of a sample in the feature dimension corresponding to the target weight of the target quantity in one sample datum is, the easier the one sample datum generates similar features with the sample datum in the data set corresponding to the mth category label, namely, an interpretability analysis result of feature conversion.

The interpretability analysis result of the feature transformation is the interpretability analysis result of the ith network hidden layer. In a possible embodiment, the decision logic (an interpretability analysis) of the ith network hidden layer in the target network model may also be analyzed. Specifically, a real label of each sample data contained in any data set is obtained, and the number of samples corresponding to the same real label is determined. Based on the number of samples corresponding to the same real label and the total amount of sample data in any data set, calculating the sample proportion corresponding to the same real label, namely the sample proportion corresponding to the same real label is the number of samples corresponding to the same real label/the total amount of sample data in any data set, thereby obtaining the sample proportion corresponding to different real labels respectively. And determining the real label of the maximum sample ratio corresponding to any data set according to the sample ratios respectively corresponding to different real labels. Assuming that the larger the sample feature of one sample data under the reference feature dimension is, the more easily the one sample data generates similar features to the sample data in any one data set, it can be deduced that the larger the sample feature of the one sample data under the reference feature dimension is, the more the real label of the i-th network hidden layer for the one sample data tends to be predicted as the real label of the maximum sample occupation ratio corresponding to any one data set, and the similarity between the feature vector after the feature conversion of the one sample data and the feature vector corresponding to the sample data under the real label of the maximum sample occupation ratio is smaller than the threshold value by the i-th network hidden layer during the feature conversion. Thus, the true label of the maximum sample ratio is used to indicate: and the similarity between the feature vectors generated by feature conversion and the feature vectors corresponding to the sample data under the real label with the maximum sample proportion is less than a threshold value.

In summary, the flow of the data processing method provided by the embodiment of the present application is shown in fig. 6, and includes: obtaining a target network model, wherein the target network model can comprise a plurality of network hidden layers. Inputting at least one sample data into the target network model to obtain a sample characterization set X ═ X output by the L-th network hidden layer ₁ ,x ₂ ,…,x _n In which x is _j Representing a sample characterization (i.e. the corresponding feature vector) of the jth sample data. Thirdly, clustering the sample representation set by using a clustering algorithm, such as a K-means algorithm, thereby dividing the sample representation into two clusters, and dividing at least one sample data into a data set X according to the cluster to which the sample representation belongs ₀ And data set X ₁ (more data sets may be partitioned, here two data sets are taken as an example). Fourthly, determining the category label of the sample data based on the cluster label corresponding to the data set where the sample data is positioned. And fifthly, fitting the sample data and the class label of the sample data by adopting a target algorithm to obtain a fitting result, and obtaining the important characteristic dimension of the L-th network hidden layer through the fitting result, wherein the important characteristic dimension is the characteristic dimension meeting the selection condition. Determining a conversion basis of the L network hidden layer during feature conversion based on the important feature dimension of the L network hidden layer, and determining an interpretability analysis result of the L network hidden layer based on the conversion basis. Sixthly, the data set X ₀ As at least one sample data when performing interpretability analysis on the L +1 th network hidden layer; and the data set X ₁ As at least one sample data when performing interpretable analysis on the L +1 th network hidden layer, thereby obtaining a data set X ₀ And data set X ₁ The subdivision logic of (1). In this way, the interpretability analysis result of each network hidden layer in the target network model can be obtained.

In this embodiment, after the sample data is subjected to the feature transformation by using the network hidden layer of the target network model to obtain the sample characterization, the sample characterization is clustered according to the class attribute of the sample data reflected by the sample characterization, so that the sample data is divided into different data sets, then the class label of the sample data is determined based on the data set where the sample data is located, further, the sample data is fitted with the corresponding class label by using the logistic regression model, so as to obtain the target weight, and the importance (influence degree) corresponding to different feature dimensions in the process of adding the corresponding class label to each sample data can be known by using the target weight, so that which feature dimensions are important in the feature transformation process of the network hidden layer can be determined, and how the feature dimensions influence the discrimination of the class label of the sample data can be known, therefore, how the feature dimension influences the feature transformation of the sample data can be determined, the transformation basis and the interpretability analysis result of the feature transformation process can be obtained, and the interpretability analysis of the feature transformation process can be realized.

In an embodiment, after all network hidden layers in the target network model obtain the interpretability analysis result, the decision path of the target network model may be determined based on the interpretability analysis result of each network hidden layer.

The following description is provided with reference to fig. 7 as an example, wherein the census-income data set comprises sample data sampled from a census bureau database, each sample data comprises a real label of whether the annual electronic resource acquisition amount of one person is more than 5 ten thousand dollars, and the characteristic dimension comprises basic attributes such as gender and the like and information such as the work type and the education year. Each sample data is acquired after obtaining the corresponding user permission or consent. In addition, the target network model includes a first network hidden layer and a second network hidden layer.

After each sample data in the census-income data set is input into the first network hidden layer for feature conversion, a feature vector corresponding to each sample data can be obtained, clustering operation is performed on each sample data by using the feature vector corresponding to each sample data, each sample data can be divided into different data sets, and the method can be specifically realized by a k-means algorithm. Two data sets are partitioned here, with a positive sample (annual electronic resource acquisition greater than $ 5 ten thousand) in data set 1 having a duty ratio of 23.53% and a positive sample in data set 2 having a duty ratio of 100%. Meanwhile, a part indicated by 71 in fig. 7 is a conversion basis for performing feature conversion on the first network hidden layer, that is, an important feature dimension of the first network hidden layer, wherein the larger the sample feature under the feature dimension indicated by the data set 1 in one sample data is, the easier the sample data is divided into the data set 1, and the larger the sample feature under the feature dimension indicated by the data set 2 in one sample data is, the easier the sample data is divided into the data set 2. It can be seen that if a person is not married, the capital loss is larger, the type of salary is private, the school calendar is high school graduation, and the gender is female, the first network hidden layer tends to make the electronic resource acquisition amount of the person not more than 5 million dollars; the higher the capital gain in one sample data, the longer the educational age, married, male gender, scholars, and master, the first network hidden layer tends to have an electronic resource acquisition of this person in excess of 5 million dollars.

Further, since all the data set 2 is positive samples, only the data set 1 may be analyzed, that is, each sample data in the data set 1 serves as at least one sample data for interpretable analysis of the second network hidden layer. Similarly, by dividing the data set and fitting the logistic regression model, the data set 3 and the data set 4 are obtained, and the conversion basis (indicated by 72 in fig. 7) for the feature transformation of the hidden layer of the second network is obtained, the proportion of positive samples in the data set 3 is 0.7%, and the proportion of positive samples in the data set 4 is 36.4%. If a person never gets married, is a cleaner, is engaged in agriculture, and has a highest academic history of 11 years, the target network model is less likely to have the person's electronic resource availability of more than 5 million dollars; the longer a person works weekly, the gender is male, and a professional, the higher the likelihood that the target network model will favor the person having an electronic resource availability in excess of 5 million dollars.

According to the example, the conversion basis of each network hidden layer can know how each network hidden layer predicts whether the acquisition amount of electronic resources of one person exceeds fifty thousand dollars layer by layer, so that a decision path of a target network model can be obtained, the problem that complex models such as a deep neural network do not have transparency can be solved, and the interpretability of the target network model is improved.

In addition, continuing with the following description using fig. 8 as an example, the target network model used in fig. 8 is used to determine whether the user purchases a candidate item (e.g., fund), wherein the positive sample is used when the user purchases the candidate item, and the negative sample is used when the user does not purchase the candidate item. Each sample data used by the target network model is obtained after obtaining the corresponding user permission or consent. The larger the sample feature under the feature dimension indicated by the cluster 0 in one sample data is, the easier the sample data is to be divided into the cluster 0, and the larger the sample feature under the feature dimension indicated by the cluster 1 in one sample data is, the easier the sample data is to be divided into the cluster 1.

Specifically, the description will be given with reference to path 1 in fig. 8. In the first network hidden layer of the target network model, important positive factors are the number of days that a user accesses a my asset page in the last 14 days, the number of times of accessing a master station in the last 92 days, the yesterday subscription amount and the like, which reflect the activity degree of the user, generally, the more active the user is, the more likely the user is to subscribe, while the average sharp rate of a fund, the income rate of the fund in the last year and the like reflect the quality of a candidate item, and the higher the quality the user is likely to be interested. In addition, the search click times of the fund in one day of the user directly reflect the preference of the user on the fund, and the higher the search click times, the better the search click times; if a user has a low value on these forward indicators, but the maximum withdrawal of the fund company on the management fund is high, the downward fluctuation rate of the management fund of the fund manager is high, and the number of times of access to the insurance-type fund page of the user is high, the performance of the current fund may be poor and higher than the risk tolerance of the user, so the probability of the user making a subscription may be relatively low. (but not representing that all users going through this layer to cluster 0 are non-subscribing users, requiring finer grained partitioning).

Through the first network hidden layer, the proportion of positive samples of the cluster 0 is 0.4448, which is obviously lower than 0.8849 of the cluster 1, and the model distinguishes some high-conversion samples; after passing through the second network hidden layer, the users in the cluster 0 can be further divided into two user groups: if the exposure times of the fund to the user are more within 30 days (the user has certain cognition), the ascending fluctuation rate of the fund is higher, the Karma ratio of the fund and the yesterday subscription amount are higher, the fund is better represented, the user still has higher probability to buy and is distributed to the cluster 1 on the second layer of the path, and the proportion of positive samples is 0.6354; on the contrary, if the indexes are low, and the fund is high in maximum withdrawal in the last year, the rate is high, and the rate of return is large (the lower the rate of return, the larger the rate of return), the probability of subscription of the user is low, the cluster 0 enters the path, and the positive sample ratio is only 0.1013.

In the second network hidden layer, the proportion of the current positive samples of the cluster 0 is already low, but a target network model still needs to be further distinguished, if the number of times of exposing the fund to the user is large in 7 days (recent impression), the average display profitability of the fund manager in the management fund is high (fund manager level), the number of times of searching and clicking the fund by the user in 7 days (user interest) is high, the profitability probability of the fund in 3 months in the fund holding period is high, and the user still has certain probability of purchasing; on the contrary, if the fund is high in rate, the number of redeemed funds of the user is high in the last day (possibly, the user goes down in market and wants to return), and the number of network loan repayment months of the user is large in the last 12 months (the user lacks the additional funds), the probability that the user purchases the candidate fund is extremely low, and the positive sample ratio reaches the end point of route 1 and is only 0.006.

Other paths may also be analyzed as such, and only case analysis is performed here for path 6 and path 8.

For the path 6, a user with a higher forward characteristic value in the first network hidden layer enters the first-layer cluster 1 (the first node distinguishes the logic same path 2), if yesterday click uv of the fund is higher, yesterday subscription amount is higher, and yesterday exposure number/conversion number is more, the fund is better represented, meanwhile, if the user history subscription amount is more, the average subscription amount of the user who subscribes the fund is also more, which indicates that a highly subscribed user may tend to buy the fund, the user with a higher matching degree enters the second-layer cluster 1, the positive sample proportion reaches 0.9461, otherwise, if the characteristic values are lower, but the sliding yield ranking is large (the yield is low) near 15 days of the fund, and the stroke number of the user who subscribes the index is more in 31 days, the user enters the cluster 0, and the positive sample proportion 0.7693.

The positive sample proportion of the current node is still higher, which indicates that the user still has higher purchase-applying willingness, and the model needs to distinguish the asset preference of the user more: the factors that the number of access times of micro securities is large in the last 30 days of a user, the amount of the reserved historical Tenn fund is high, the RFM model of stock securities has high preference score and the like indicate that the user tends to advance financing, and if the fund to be ranked is the advance financing, the number of days for the user to hold the fund in the last 6 months is large, and a search click behavior exists in the near future, the probability of purchase of the user is greatly improved, the user enters the end point of the path 6, and the positive sample proportion reaches 0.8833.

For the path 8, in the first layer, the user with a higher forward characteristic value enters the cluster 1 of the first layer (the first node distinguishes the same logical path 2 and 6), in the second layer, the user with a higher forward characteristic value enters the cluster 1 of the second layer, the same logical path 6 is distinguished (the fund is better in performance and the user has more money to purchase), in the third layer, if the electronic resource amount + the accumulated income of the user is higher (the retention amount is higher), and the days of more than 1 ten thousand of the average electronic resource amount + the retention amount of the user who purchases the fund are longer, the probability of purchasing the user is higher, and meanwhile, if the fund has a higher annual profit rate, a large number of stable debt base warehouse products of the user, and the fund also belongs to the stable financing, the probability of purchasing the user is higher (the user preference is matched with the asset).

Therefore, the method has strong interpretability for the target network model used in the financial scene, and can realize the output of the decision path from input to output. In addition, through the decision path of the target network model, the target network model can be known to mainly use which feature dimensions to perform classification prediction, and whether the target network model grasps meaningful features can be judged, so that the feature selection algorithm of the target network model can be improved. For example, it can be known from fig. 8 that the feature dimensions of the target network model, which are mainly concerned by each network hidden layer, may be less concerned by the target network model, and therefore, the target network model may only use the feature dimensions included in the decision path of the target network model when performing the classification prediction.

It is understood that in the specific implementation of the present application, related data such as sample data is referred to, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

While the method of the embodiments of the present application has been described in detail above, to facilitate better performing the method of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly. Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure, where the data processing apparatus 90 may include:

an obtaining unit 901, configured to obtain at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the category label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions;

the processing unit 902 is configured to perform fitting processing on sample features of each sample data in different feature dimensions according to corresponding class label tags to obtain a fitting result, so as to indicate importance degrees corresponding to different feature dimensions in a process of adding corresponding class label tags to each sample data;

the processing unit 902 is further configured to use, according to the importance indicated by the fitting result, the feature dimension meeting the selection condition as a conversion basis in the feature conversion process; the transformation is based on interpretable analysis results used to generate the feature transformation.

In an embodiment, the obtaining unit 901 is specifically configured to: obtaining a target algorithm for performing interpretable analysis;

the processing unit 902 is specifically configured to: and fitting the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding class label by adopting the target algorithm to obtain a fitting result.

In an embodiment, the processing unit 902 is specifically configured to: performing characteristic conversion on the sample characteristics of each sample data under different characteristic dimensions to obtain a characteristic vector corresponding to each sample data; performing clustering operation on the at least one sample data based on the feature vector corresponding to each sample data to divide the at least one sample data into different data sets, wherein one data set corresponds to one clustering label; and taking the cluster label corresponding to the data set where any sample data is located as the category label of the sample data.

In an embodiment, the feature vector corresponding to each sample data in the at least one sample data is obtained by calling an ith network hidden layer of a target network model with a classification function to perform feature conversion, wherein the target network model comprises N network hidden layers; wherein N is a positive integer greater than or equal to 1, and i is a positive integer greater than 0 and less than or equal to N; and taking an interpretability analysis result obtained by performing feature conversion on the at least one sample data based on the ith network hidden layer as an interpretability analysis result of the ith network hidden layer.

In an embodiment, the obtaining unit 901 is specifically configured to: acquiring at least one reference sample data used for carrying out interpretable analysis on an i-1 network hidden layer in the target network model, and acquiring at least two reference sample sets obtained after clustering operation is carried out on the at least one reference sample data in the process of carrying out interpretable analysis on the i-1 network hidden layer;

the processing unit 902 is specifically configured to: and respectively taking the reference sample data contained in any one reference sample set as at least one sample data obtained.

In an embodiment, the obtaining unit 901 is specifically configured to: acquiring the total amount of the data sets obtained after the clustering operation is carried out on the at least one sample data, and determining the set number corresponding to each acquired data set based on the data set total amount;

the processing unit 902 is specifically configured to: and taking the set number corresponding to one data set as the cluster label corresponding to the one data set.

In an embodiment, the obtaining unit 901 is specifically configured to: acquiring a real label of each sample data contained in any data set, and determining the number of samples corresponding to the same real label;

the processing unit 902 is specifically configured to: calculating the sample proportion corresponding to the same real label based on the number of samples corresponding to the same real label and the total amount of sample data in any data set, and obtaining the sample proportion corresponding to different real labels respectively; determining a real label corresponding to the maximum sample ratio according to the sample ratios respectively corresponding to different real labels; the true label of the maximum sample ratio is used to indicate: and the similarity between the feature vectors generated by feature conversion and the feature vectors corresponding to the sample data under the real label with the maximum sample ratio is less than a threshold value.

In an embodiment, the processing unit 902 is specifically configured to: performing label prediction processing on sample characteristics of each sample data under different characteristic dimensions by adopting the target algorithm to obtain a category prediction result of each sample data; based on the class prediction result of each sample data and the corresponding class label, the prediction weight of the target algorithm is adjusted; and taking the adjusted prediction weight as a fitting result.

In an embodiment, the obtaining unit 901 is specifically configured to: selecting a class label of a target label type from the obtained class labels, and taking the selected class label of the target label type as a fitting target;

the processing unit 902 is specifically configured to: and based on the fitting target, performing fitting processing on the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding category label by adopting the target algorithm to obtain a fitting result, wherein the fitting result obtained based on the fitting processing of the fitting target is a fitting result associated with the fitting target.

In an embodiment, the number of the obtained tag types of the category label tag is at least one, and the processing unit 902 is specifically configured to: and sequentially selecting one label type from at least one label type as a target label type so as to select the class label of the target label type.

In an embodiment, the processing unit 902 is specifically configured to: sequentially selecting the characteristic dimensions of the target number according to the sequence of the importance degrees indicated by the fitting results from large to small, and taking the selected characteristic dimensions as conversion bases in the characteristic conversion process; or selecting the feature dimension with the corresponding importance degree larger than the importance degree threshold value according to the importance degree indicated by the fitting result, and taking the selected feature dimension as a conversion basis in the feature conversion process.

It can be understood that the functions of the functional units of the data processing apparatus described in the embodiments of the present application can be specifically implemented according to the method in the foregoing method embodiments, and the specific implementation process of the method can refer to the description related to the foregoing method embodiments, which is not described herein again.

In the embodiment of the application, at least one sample data and a class label of any sample data can be obtained, the class label of any sample data is obtained after the sample features of any sample data under different feature dimensions are subjected to feature conversion, a fitting result can be obtained by fitting the sample features of each sample data under different feature dimensions according to the corresponding class label, the fitting result can indicate the importance degrees corresponding to different feature dimensions in the process of adding the corresponding class label to each sample data, therefore, the feature dimension meeting the selection condition can be used as a conversion basis in the feature conversion process according to the importance degrees indicated by the fitting result, and the conversion basis can be used for generating an interpretability analysis result of the feature conversion. By the embodiment of the application, interpretability analysis of the characteristic conversion process can be realized.

As shown in fig. 10, fig. 10 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and an internal structure of the computer device 100 is shown in fig. 10, and includes: one or more processors 1001, memory 1002, and a communications interface 1003. The processor 1001, the memory 1002, and the communication interface 1003 may be connected by a bus 1004 or in other manners, and in the embodiment of the present application, the connection by the bus 1004 is taken as an example.

The processor 1001 (or CPU) is a computing core and a control core of the computer device 100, and can analyze various instructions in the computer device 100 and process various data of the computer device 100, for example: the CPU may be configured to analyze a power on/off instruction sent by the user to the computer device 100, and control the computer device 100 to perform power on/off operation; the following steps are repeated: the CPU may transfer various types of interactive data between the internal structures of the computer device 100, and so on. The communication interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), controlled by the processor 1001 for transmitting and receiving data. The Memory 1002(Memory) is a Memory device in the computer device 100 for storing computer programs and data. It is understood that the memory 1002 herein may comprise both the built-in memory of the computer device 100 and, of course, the expansion memory supported by the computer device 100. The memory 1002 provides storage space that stores the operating system of the computer device 100, which may include, but is not limited to: windows system, Linux system, Android system, iOS system, etc., which are not limited in this application. The processor 1001 performs the following operations by executing the computer program stored in the memory 1002:

acquiring at least one sample data and a category label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the class label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions;

according to the importance degree indicated by the fitting result, the feature dimension meeting the selection condition is used as a conversion basis in the feature conversion process; the transformation is based on interpretability analysis results used to generate the feature transformation.

In an embodiment, the processor 1001 is specifically configured to: obtaining a target algorithm for performing interpretable analysis; and fitting the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding class label by adopting the target algorithm to obtain a fitting result.

In an embodiment, the processor 1001 is specifically configured to: performing characteristic conversion on the sample characteristics of each sample data under different characteristic dimensions to obtain a characteristic vector corresponding to each sample data; performing clustering operation on the at least one sample data based on the feature vector corresponding to each sample data to divide the at least one sample data into different data sets, wherein one data set corresponds to one clustering label; and taking the cluster label corresponding to the data set where any sample data is located as the category label of the sample data.

In an embodiment, the feature vector corresponding to each sample data in the at least one sample data is obtained by calling the ith network hidden layer of the target network model containing the classification function to perform feature conversion, and the target network model contains N network hidden layers; wherein N is a positive integer greater than or equal to 1, and i is a positive integer greater than 0 and less than or equal to N; and taking an interpretability analysis result obtained by performing feature conversion on the at least one sample data based on the ith network hidden layer as an interpretability analysis result of the ith network hidden layer.

In an embodiment, the processor 1001 is specifically configured to: acquiring at least one reference sample data used for carrying out interpretable analysis on the (i-1) th network hidden layer in the target network model, and acquiring at least two reference sample sets obtained after clustering operation is carried out on the at least one reference sample data in the process of carrying out interpretable analysis on the (i-1) th network hidden layer; and respectively taking the reference sample data contained in any one reference sample set as at least one sample data obtained by acquisition.

In an embodiment, the processor 1001 is specifically configured to: acquiring the total amount of the data sets obtained after the clustering operation is carried out on the at least one sample data, and determining the set number corresponding to each acquired data set based on the data set total amount; and taking the set number corresponding to one data set as the cluster label corresponding to the one data set.

In an embodiment, the processor 1001 is specifically configured to: acquiring a real label of each sample data contained in any data set, and determining the number of samples corresponding to the same real label; calculating sample proportion corresponding to the same real label based on the number of samples corresponding to the same real label and the total amount of sample data in any data set to obtain sample proportion corresponding to different real labels; determining a real label corresponding to the maximum sample ratio according to the sample ratios respectively corresponding to different real labels; the true label of the maximum sample proportion is used to indicate: and the similarity between the feature vectors generated by feature conversion and the feature vectors corresponding to the sample data under the real label with the maximum sample ratio is less than a threshold value.

In an embodiment, the processor 1001 is specifically configured to: performing label prediction processing on sample characteristics of each sample data under different characteristic dimensions by adopting the target algorithm to obtain a category prediction result of each sample data; based on the class prediction result of each sample data and the corresponding class label, the prediction weight of the target algorithm is adjusted; and taking the adjusted prediction weight as a fitting result.

In an embodiment, the processor 1001 is specifically configured to: selecting a class label of a target label type from the obtained class labels, and taking the selected class label of the target label type as a fitting target; based on the fitting target, the target algorithm is adopted, and fitting processing is carried out on the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding category label to obtain a fitting result, wherein the fitting result obtained based on the fitting processing of the fitting target is the fitting result associated with the fitting target.

In an embodiment, the number of the obtained label types of the category label is at least one; the processor 1001 is specifically configured to: and sequentially selecting one label type from at least one label type as a target label type so as to select the class label of the target label type.

In an embodiment, the processor 1001 is specifically configured to: sequentially selecting the feature dimensions of the target number according to the sequence from large importance degree to small importance degree indicated by the fitting result, and taking the selected feature dimensions as a conversion basis in the feature conversion process; or selecting the feature dimension of which the corresponding importance degree is greater than the importance degree threshold value according to the importance degree indicated by the fitting result, and taking the selected feature dimension as a conversion basis in the feature conversion process.

In a specific implementation, the processor 1001, the memory 1002, and the communication interface 1003 described in this embodiment may execute an implementation manner described in a data processing method provided in this embodiment, and may also execute an implementation manner described in a data processing apparatus provided in this embodiment, which is not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on a computer device, the computer device is caused to execute the data processing method of any one of the foregoing possible implementation manners. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

Embodiments of the present application further provide a computer program product, where the computer program product includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the data processing method provided in the embodiments of the present application are implemented. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

The embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided in the embodiment of the present application. For a specific implementation, reference may be made to the foregoing description, which is not repeated herein.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above disclosure is only a few examples of the present application, and certainly should not be taken as limiting the scope of the present application, which is therefore intended to cover all modifications that are within the scope of the present application and which are equivalent to the claims.

Claims

1. A method of data processing, the method comprising:

acquiring at least one sample data and a class label of any sample data; one sample data contains sample characteristics under one or more characteristic dimensions, and the class label of any sample data is obtained by performing characteristic conversion on the sample characteristics of any sample data under different characteristic dimensions;

2. The method according to claim 1, wherein the fitting processing is performed on the sample features of each sample datum under different feature dimensions according to the corresponding class label to obtain a fitting result, and the fitting processing comprises:

acquiring a target algorithm for performing interpretability analysis;

and fitting the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding class label by adopting the target algorithm to obtain a fitting result.

3. The method of claim 1, wherein the manner of obtaining the class label of any sample data comprises:

performing characteristic conversion on sample characteristics of each sample data under different characteristic dimensions to obtain a characteristic vector corresponding to each sample data;

performing clustering operation on the at least one sample data based on the feature vector corresponding to each sample data to divide the at least one sample data into different data sets, wherein one data set corresponds to one clustering label;

and taking the cluster label corresponding to the data set where any sample data is located as the category label of the sample data.

4. The method according to claim 3, wherein the feature vector corresponding to each sample data in the at least one sample data is obtained by calling an ith network hidden layer of a target network model including a classification function to perform feature transformation, and the target network model includes N network hidden layers; wherein N is a positive integer greater than or equal to 1, and i is a positive integer greater than 0 and less than or equal to N;

and taking an interpretability analysis result obtained by performing feature conversion on the at least one sample data based on the ith network hidden layer as an interpretability analysis result of the ith network hidden layer.

5. The method of claim 4, wherein said obtaining at least one sample of data comprises:

acquiring at least one reference sample data used for carrying out interpretable analysis on the (i-1) th network hidden layer in the target network model, and acquiring at least two reference sample sets obtained after clustering operation is carried out on the at least one reference sample data in the process of carrying out interpretable analysis on the (i-1) th network hidden layer;

and respectively taking the reference sample data contained in any one reference sample set as at least one sample data obtained.

6. The method of claim 3, wherein determining a cluster label for a data set comprises:

acquiring the total amount of the data sets obtained after the clustering operation is carried out on the at least one sample data, and determining the set number corresponding to each acquired data set based on the data set total amount;

and taking the set number corresponding to one data set as the cluster label corresponding to the one data set.

7. The method of claim 3, further comprising:

acquiring a real label of each sample data contained in any data set, and determining the number of samples corresponding to the same real label;

calculating the sample proportion corresponding to the same real label based on the number of samples corresponding to the same real label and the total amount of sample data in any data set, and obtaining the sample proportion corresponding to different real labels respectively;

determining a real label corresponding to the maximum sample ratio according to the sample ratios respectively corresponding to different real labels; the true label of the maximum sample ratio is used to indicate: and the similarity between the feature vectors generated by feature conversion and the feature vectors corresponding to the sample data under the real label with the maximum sample ratio is less than a threshold value.

8. The method according to claim 2, wherein the fitting processing is performed on the sample features of each sample datum under different feature dimensions according to the corresponding class label by using the target algorithm to obtain a fitting result, and the method includes:

performing label prediction processing on sample characteristics of each sample data under different characteristic dimensions by adopting the target algorithm to obtain a category prediction result of each sample data;

based on the class prediction result of each sample data and the corresponding class label, the prediction weight of the target algorithm is adjusted;

and taking the adjusted prediction weight as a fitting result.

9. The method according to claim 2, wherein the obtaining of the fitting result by using the target algorithm to perform fitting processing on the sample features of each sample datum under different feature dimensions according to the corresponding class label comprises:

selecting a class label of a target label type from the obtained class labels, and taking the selected class label of the target label type as a fitting target;

based on the fitting target, the target algorithm is adopted, and fitting processing is carried out on the sample characteristics of each sample datum under different characteristic dimensions according to the corresponding category label to obtain a fitting result, wherein the fitting result obtained based on the fitting processing of the fitting target is the fitting result associated with the fitting target.

10. The method of claim 9, wherein the number of tag types of the obtained category label is at least one; the selecting a category label of the target label type from the acquired category labels includes:

and sequentially selecting one label type from at least one label type as a target label type so as to select the class label of the target label type.

11. The method according to claim 1, wherein the step of using the feature dimensions meeting the selection condition as a conversion basis in the feature conversion process according to the importance indicated by the fitting result comprises:

sequentially selecting the characteristic dimensions of the target number according to the sequence of the importance degrees indicated by the fitting results from large to small, and taking the selected characteristic dimensions as conversion bases in the characteristic conversion process; or,

and selecting the feature dimension with the corresponding importance degree larger than the importance degree threshold value according to the importance degree indicated by the fitting result, and taking the selected feature dimension as a conversion basis in the feature conversion process.

12. A data processing apparatus, characterized in that the apparatus comprises:

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a data processing method according to any one of claims 1 to 11.

14. A computer device, comprising a memory, a communication interface, and a processor, the memory, the communication interface, and the processor being interconnected; the memory stores a computer program that the processor calls upon for implementing the data processing method according to any one of claims 1 to 11.

15. A computer program product, characterized in that the computer program product comprises a computer program or computer instructions which, when executed by a processor, implement the data processing method of any one of claims 1-11.