CN110598065A - Data mining method and device and computer readable storage medium - Google Patents

Data mining method and device and computer readable storage medium Download PDF

Info

Publication number
CN110598065A
CN110598065A CN201910801360.4A CN201910801360A CN110598065A CN 110598065 A CN110598065 A CN 110598065A CN 201910801360 A CN201910801360 A CN 201910801360A CN 110598065 A CN110598065 A CN 110598065A
Authority
CN
China
Prior art keywords
data
cluster
node
purity
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910801360.4A
Other languages
Chinese (zh)
Inventor
余莉萍
石楷弘
王吉
陈志博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Cloud Computing Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Cloud Computing Beijing Co Ltd filed Critical Tencent Cloud Computing Beijing Co Ltd
Priority to CN201910801360.4A priority Critical patent/CN110598065A/en
Publication of CN110598065A publication Critical patent/CN110598065A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a data mining method, a data mining device and a computer readable storage medium; the method comprises the steps of extracting features of a data set to be processed to construct a feature space, extracting node features from the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node, screening out a data cluster corresponding to the node from the graph data, calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold value, acquiring the data corresponding to the node in the data set to be processed to obtain mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the Bad files are evaluated through the purity in the data cluster, so that the Bad files are mined, the transition dependence on characteristic representation is reduced, the Bad files (Bad files) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the Bad files in the data is improved.

Description

Data mining method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of communication, in particular to a data mining method, a data mining device and a computer readable storage medium.
Background
In a data mining scene, pure data is needed no matter image data or text data, but the data is limited by the representation capability of a model, and the purity of the data cluster can not be ensured due to bad files (Badcase) of data generated in the modes of classification, clustering and the like.
In the research and practice process of the prior art, the inventor of the invention finds that a large amount of labor cost is consumed by manual searching, the distance between every two characteristics in a data cluster is simply calculated, and Bad files and normal data are difficult to distinguish due to the type diversity and the large difference of the data cluster, so that the hit rate of Bad cases is low.
Disclosure of Invention
The embodiment of the invention provides a data mining method, a data mining device and a computer readable storage medium. The hit rate of bad files in data mining can be improved.
A method of data mining, comprising:
extracting features of a data set to be processed to construct a feature space;
extracting node features in the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node;
screening out data clusters corresponding to the nodes from the graph data;
calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster;
and when the intra-cluster purity is lower than a preset intra-cluster purity threshold value, acquiring data corresponding to the nodes in the data set to be processed to obtain mined data.
Correspondingly, an embodiment of the present invention provides a data mining apparatus, including:
the extraction unit is used for extracting the features of the data set to be processed so as to construct a feature space;
a generating unit, configured to extract node features in the feature space to generate graph data of the to-be-processed data set, where the graph data includes at least one node;
the screening unit is used for screening out the data clusters corresponding to the nodes from the graph data;
the computing unit is used for computing the data purity of the data cluster to obtain the intra-cluster purity of the data cluster;
and the acquisition unit is used for acquiring the corresponding data of the nodes in the data set to be processed to obtain the mined data when the purity in the cluster is lower than a preset purity threshold value.
Optionally, in some embodiments, the calculating unit is specifically configured to perform feature extraction on the data cluster by using a trained graph recognition model to obtain data information of the data cluster, classify data in the data cluster according to the data information, and calculate data purity of the data cluster according to a classification result to obtain intra-cluster purity of the data cluster.
Optionally, in some embodiments, the calculating unit is specifically configured to obtain, according to the classification result, the number of each category of data and the total number of data in the data cluster in the data information, screen the data with the largest number from among the numbers of each category of data to serve as target data, and calculate a ratio between the target data and the total number of data in the data cluster to obtain the intra-cluster purity of the data cluster.
Optionally, in some embodiments, the computing unit is specifically configured to collect a plurality of data set samples, where the data set samples include data clusters with labeled cluster purity, predict the cluster purity of the data set samples by using a preset graph recognition model to obtain predicted cluster purity, and converge the preset graph recognition model according to the predicted cluster purity and the labeled cluster purity to obtain a trained graph recognition model.
Optionally, in some embodiments, the obtaining unit is specifically configured to, when the cluster purity is lower than a preset intra-cluster purity threshold, determine a target node corresponding to the data cluster, screen graph data corresponding to the target node from the graph data of the to-be-processed data set, obtain data corresponding to the node from the to-be-processed data set according to the graph data corresponding to the node, and use the data as data to be mined from the to-be-processed data set.
Optionally, in some embodiments, the screening unit is specifically configured to search the graph data for a neighboring node corresponding to the node, cluster the node and the corresponding neighboring node in the graph data to obtain a cluster map of the node, and screen a data cluster corresponding to the node in the cluster map.
Optionally, in some embodiments, the extracting unit is specifically configured to extract node features in the feature space, classify the node features, and generate graph data of the to-be-processed data set according to a classification result.
Optionally, in an embodiment, the extracting unit is specifically configured to extract node information in the node features of each category according to a classification result, construct a relationship tree according to the node information, and generate graph data of the to-be-processed data set based on the constructed relationship tree.
In addition, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores an application program, and the processor is configured to run the application program in the memory to implement the data mining method provided in the embodiment of the present invention.
In addition, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the data mining methods provided by the embodiment of the present invention.
The method comprises the steps of extracting features of a data set to be processed to construct a feature space, extracting node features from the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node, screening out a data cluster corresponding to the node from the graph data, calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold value, acquiring the data corresponding to the node in the data set to be processed to obtain mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the Bad files are evaluated through the purity in the data cluster, so that the Bad files are mined, the transition dependence on characteristic representation is reduced, the Bad files (Bad files) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the Bad files in the data is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic scene diagram of a data mining method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data mining method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a structure of graph data provided by an embodiment of the present invention;
FIG. 4 is a schematic flow chart of intra-cluster purity calculation for a data cluster provided by an embodiment of the present invention;
FIG. 5 is another schematic flow chart diagram of a data mining method according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a data mining apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an extraction unit of the data mining device according to the embodiment of the present invention;
fig. 8 is a schematic structural diagram of a generating unit of the data mining apparatus according to the embodiment of the present invention;
fig. 9 is a schematic structural diagram of a screening unit of the data mining apparatus according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of a computing unit of the data mining device according to the embodiment of the present invention;
fig. 11 is another schematic structural diagram of a data mining device according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a data mining method, a data mining device and a computer readable storage medium. The data mining device may be integrated in an electronic device, and the electronic device may be a server or a terminal.
Data mining can be a process of extracting hidden information and knowledge from a large amount of incomplete, noisy, fuzzy and random data, which is not known in advance but is potentially useful, and can also be a process of finding and extracting bad files (badcases) in the data from a large amount of data. The bad file (Badcase) may include that a plurality of different types of data exist in a data cluster, and since one cluster of the data cluster can only accommodate one or one file, when a plurality of different types of data exist in the cluster, the data may be confused, and therefore, when the data is processed, the Badcase in the data needs to be mined. In the embodiment of the invention, the Badcase is mainly mined from massive data.
For example, referring to fig. 1, taking an example that a data mining device is integrated in an electronic device, the electronic device performs feature extraction on a data set to be processed to construct a feature space, then extracts node features in the feature space to generate graph data of the data set to be processed, where the graph data at least includes one node, then screens out a data cluster corresponding to the node from the graph data, calculates data purity of the data cluster to obtain intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold, obtains data corresponding to the node in the data set to be processed to obtain mined data.
The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
In this embodiment, the data mining apparatus will be described in terms of a text label generating apparatus, where the text label generating apparatus may be specifically integrated in an electronic device, and the electronic device may be a server or a terminal; the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), and other devices.
A method of data mining, comprising: the method comprises the steps of extracting features of a data set to be processed to construct a feature space, extracting node features from the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node, screening out data clusters corresponding to the node from the graph data, calculating the data purity of the data clusters to obtain the intra-cluster purity of the data clusters, and when the intra-cluster purity is lower than a preset purity threshold, obtaining the data corresponding to the node in the data set to be processed to obtain mined data.
As shown in fig. 2, the specific flow of the data mining method is as follows:
101. and performing feature extraction on the data set to be processed to construct a feature space.
The feature space may include a space in which all feature vectors exist, and the space stores all features in the to-be-processed data set, and stores all the features in the form of feature vectors, including relationship attributes between the features, for example, the relationship between the features is represented by nodes, and also includes attributes of the features themselves.
(1) Acquiring a data set to be processed;
for example, the method for obtaining the data set to be processed may be various, for example, data may be obtained from the internet, such as downloaded or collected, and the data is composed into a data set, and the method may further include that the user uploads the data to a server, and the data mining device obtains the data uploaded by the user from the server to compose the data set, where the data set may include one type of data and may also include multiple types of data.
(2) Extracting features of a data set to be processed to construct a feature space;
for example, there are various methods for extracting features of the data set to be processed, for example, a deep residual error network may be used to extract feature information of the data in the data set to be processed, such as feature information of a structure of the data, a relationship between the data and the data, and/or a type of the data. And arranging and storing the extracted feature information in a feature vector form to construct a feature space. For example, an overall structure of the feature space is constructed by using the relationship between the features, where the connection position in the overall structure may be an intersection point or a node between the features, the feature space is perfected or enriched by using the attributes of the features themselves to form a feature space containing all extracted feature information, and all the feature information is stored in the feature space.
102. And extracting node features in the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node.
The node characteristics may be intersection points formed by the characteristics and the interrelations between the characteristics or characteristic information of the nodes, and the node characteristics may include information of one or more nodes. Graph data may be one of the data structures, also referred to as a graph, which may include nodes and edges, a node may have two or more neighboring elements, the connection between two nodes is called an edge,
for example, the node features are extracted from the feature space and classified in various manners, for example, the node features may be classified in a hierarchical clustering manner, or the node features may be classified by using a K-neighbor algorithm to obtain different types of node features, for example, the node features may be classified into a head node feature, a middle node feature and a tail node feature according to different positions of the node features in the feature space.
Corresponding node information is extracted from various kinds of node features according to the classification result, for example, one or more node information of a head part is extracted from the head node feature, one or more node information of a middle part is extracted from the middle node feature, and one or more node information of a tail part is extracted from the tail node feature. The relationship tree is constructed according to the extracted node information of each kind, for example, the relationship tree can be constructed according to the mutual relationship among the extracted node information, for example, the mutual relationship among one or more node information at the tail part is obtained, the node information of the root node is judged according to the mutual relationship, the root part of the relationship coefficient is constructed based on the node information of the following node, the nodes in the trunk upwards from the root node in the relationship tree are sequentially found out according to the obtained information of the root node, the nodes on the branch corresponding to the nodes on the trunk are searched in the rest node information according to the node information of the trunk on the trunk, the nodes are connected with each other, the branch and the trunk of the relationship tree are formed, and then the information of the leaf nodes on the branch is obtained according to the node information on the branch and the trunk, so that the construction of the relationship tree is completed.
And generating graph data of the data set to be processed according to the constructed relation tree. For example, the feature attributes on the nodes may be populated or fused into the relationship tree such that each node in the relationship tree includes one or more data, and the data in the features may be mapped to each connecting line in the relationship tree to form an edge of the graph data. The generated graph data structures and visualizes data in a data set to be processed, the graph data can visually reflect the mutual relations such as the position relations, the structure relations and the like between the data in the data set and the data, and can also include attribute information of the data, a common graph data structure is shown in fig. 3, each node or vertex can be each data in the data set, the relation between each data can be represented by an edge, and if the relation between the data and the data is added on the edge, the edge in the graph data can also represent some data in the data set.
The relational tree is also called a tree structure, can be a data structure with one-to-many tree relations among data elements, and is an important nonlinear data structure. In the tree structure, elements are connected with one another through nodes to form the tree structure, wherein the tree structure is compared as a tree, a tree root node has no precursor node, and each of the other nodes has one precursor node. The leaf node has no subsequent node, and the number of the subsequent nodes of each of the rest nodes can be one or more.
103. And screening data clusters corresponding to the nodes in the graph data.
A data cluster, also called a cluster, may be the smallest storage management unit in a data storage, for example, a file is usually stored in one or more clusters, but at least one "cluster" is occupied by the file, that is, two files cannot be stored in the same cluster. In brief, files or data are stored in a data cluster in a computer system, and one file or a class of files is stored in the same data cluster.
For example, a node is randomly selected from the graph data, the selected node is used as a target node, the selected target node is used to search for an adjacent node corresponding to the target node, there are various methods for searching for the adjacent node, for example, K adjacent nodes of the target node can be found through cosine similarity, where K can be any value, where the adjacent nodes can include nodes directly adjacent to the target node, and also can include nodes separated from the target node in the graph data by a distance within a preset distance threshold, for example, the target node and the remaining other nodes are converted into vectors in space in the graph data, cosine values of included angles between the vectors converted by the remaining other nodes and the target vector converted by the target node are used as measures or judges cosine distances between the remaining other nodes and the target node, and the judgment is performed through the cosine distances between the remaining other nodes and the target node and the preset distance threshold, and taking the residual nodes with cosine distances within a preset distance threshold value as the neighbor nodes of the target node.
And performing hierarchical clustering on the target node and the corresponding adjacent nodes thereof in the graph data to obtain a clustering graph of the target node, wherein the merging algorithm of the hierarchical clustering can be a process of combining two most similar data points in all the data points by calculating the similarity between the two data points and repeating iteration. Briefly, the merging algorithm of hierarchical clustering determines the similarity between data points of each category by calculating the distance between them, and the smaller the distance, the higher the similarity. And combining the two data points or categories with the closest distance to generate a cluster map, thereby completing the classification of the data. For example, in the graph data, each node may be regarded as one or more data, the obtained target node and its corresponding neighboring node may include multiple or multiple types of data, two most similar data in all data are combined by calculating the similarity between multiple data of the target node and its corresponding neighboring node, and the process is iterated repeatedly, and finally, the data in the target node and its corresponding neighboring node are divided into two or multiple types, and a cluster graph is generated.
And screening the data clusters corresponding to the target nodes in the cluster map, for example, generating a cluster subgraph of the target nodes according to the target nodes in the cluster map, and taking the cluster subgraph as the data clusters corresponding to the target nodes. The clustering subgraph can be regarded as a sub-topological graph formed by one or more data which are closest to or most similar to the target node clusters in the clustering graph. It can be seen that the data cluster contains a plurality of data, and the plurality of data may be one type or multiple types.
104. And calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster.
The data purity may include a ratio between a data amount and a total data amount of each type or category of data in the data cluster. The intra-cluster purity may include a ratio between a type of data having the most amount of data within a data cluster and a total amount of data.
For example, feature extraction is performed on the data cluster by using the trained graph recognition model to obtain data information of the data cluster. For example, feature extraction may be performed on a data cluster through a Graph Convolutional neural Network (GCN) to obtain data information of the data cluster, which may specifically be as follows:
each node in the data cluster sends the self characteristic information to respective adjacent nodes after transformation, the adjacent nodes at the moment comprise adjacent nodes directly connected through a sideline, each node gathers the characteristic information sent by each adjacent node to carry out fusion of the characteristic information, and data information in the data cluster is obtained, wherein the data information comprises the total amount of all data in the data cluster and attribute information of all the data.
Wherein the selected target node vjAs a central vertex to the GCN model, the GCN model employs a target node vjAnd taking a clustering subgraph (data cluster) formed by the associated vertex sets on the visual features as input, and extracting the features of the clustering subgraph (data cluster) to obtain the data information of the clustering subgraph (data cluster). The calculation formula is as follows:
wherein, A (P)i) Is a target node vjA cluster subgraph formed by a set of related vertices on visual features,andrespectively data information of any node with which the target node in the clustered sub-graph is associated,is a diagonal matrix, I is an identity matrix, Fl(Pi) Is a characteristic expression of layer I, WlIs the feature map learned by the l-th layer GCN model, and σ is the activation function. It should be noted here that the activation function may select ReLU (an activation function).
After acquiring the data information in the data cluster through feature extraction, the GCN model classifies the data in the data cluster according to the data information, for example, according to attribute information of the data, and may also classify according to a structure of the data, for example, classifying according to the structure of the data may include that the data with the same data structure may be classified into one class. And acquiring the quantity of each category of data and the total quantity of data of the data cluster in the data information according to the classification result. For example, according to the classification result, the data cluster can be classified into class a data, class B data and class C data, where the class a data includes data 1 and data 2, the class B data includes data 3 and data 4, the class C data includes data 5 and data 6, the number of the data 1 to the data 6 is obtained in the data information, based on the obtained data 1 to the data 6, the number of the class a data can be obtained as the sum of the number of the data 1 and the number of the data 2, and similarly, the number of the class B data and the class C data can be obtained, and the total number of all the data in the data cluster can also be obtained.
And screening the data with the most data from the quantity of each category data to be used as target data. For example, as shown in fig. 4, a square represents class a data, a circle represents class B data, a pentagon represents class C data, assuming that the number of class a data is 100, the number of class B data is 20, and the number of class C data is 10, then the class a data (i.e. the square in fig. 4) is screened out from A, B and the class C data as the most numerous data, the class a data (i.e. the square in fig. 4) is taken as the target data, and may also be taken as the class represented by the data cluster, the intra-cluster purity of the corresponding data cluster is the data purity of the class a data (i.e. the square in fig. 4), the ratio of the target data to the total number of data of the data cluster is calculated to obtain the intra-cluster purity of the data cluster, as shown in fig. 4, the intra-cluster purity is the ratio of the number of class a data to the total number of A, B and C data in the data cluster, the calculation formula is as follows:
wherein, purity (P)i,Cgt) In-cluster purity of target data, wkFor the total amount of data in the data cluster, Cgt=c1,c2,…,cMIs the result of the original classification.
For example, if the number of the class a data is 100, the number of the class B data is 20, and the number of the class C data is 10, and the target data is the class a data, the intra-cluster purity of the data cluster is a ratio of 100 numbers of the class a data to 130 numbers of total numbers of data in the data cluster, and is approximately equal to 0.769.
Optionally, the trained graph recognition model may be set by operation and maintenance personnel in advance, or may be obtained by self-training of the data mining device. Before the step of performing feature extraction on the data cluster by using the trained graph recognition model, the data mining method may further include:
(1) a plurality of data set samples are collected, the data set samples including data clusters with labeled intra-cluster purities.
For example, the manner of collecting a plurality of data set samples may be various, for example, data clusters composed of data of known data types and quantities may be downloaded from the internet, the data clusters are composed into data set samples, the intra-cluster purity of the data clusters is calculated according to a calculation formula and labeled, and the data set samples of known data types and data and the intra-cluster purity corresponding to the data clusters of the data set samples may be uploaded to the data mining device.
(2) And predicting the intra-cluster purity of the data set sample by adopting a preset graph recognition model to obtain the predicted intra-cluster purity.
For example, feature extraction may be specifically performed on a data set sample to construct a feature space, node features are extracted from the feature space to generate graph data of the data set sample, where the graph data includes at least one node, a data cluster corresponding to the node is screened out from the graph data, and data purity of the data cluster is calculated to obtain predicted intra-cluster purity of the data cluster in the data set.
(3) And converging the preset graph recognition model according to the predicted intra-cluster purity and the marked cluster purity to obtain the trained graph recognition model.
In the embodiment of the invention, the preset graph recognition model can be converged according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by the interpolation loss function, so as to obtain the trained graph recognition model. For example, the following may be specifically mentioned:
and adjusting parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting a Dice function (a loss function), and adjusting the parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting an interpolation loss function to obtain the trained graph recognition model.
Optionally, in order to improve the accuracy of the context feature, besides the Dice function, other loss functions such as a cross entropy loss function may be used for convergence, which may specifically be as follows:
and adjusting parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting a cross entropy loss function, and adjusting the parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting an interpolation loss function to obtain the trained graph recognition model.
105. And when the purity in the cluster is lower than a preset purity threshold value, acquiring data corresponding to the nodes in the data set to be processed to obtain mined data.
(1) When the purity in the cluster is lower than a preset purity threshold value, acquiring data corresponding to the nodes in the data set to be processed to obtain mined data;
for example, when the intra-cluster purity is lower than the preset purity threshold, determining a target node corresponding to the data, for example, calculating that the intra-cluster purity of the data cluster corresponding to the node a is 0.769, and the preset intra-cluster purity is 0.8, and then the intra-cluster purity of the data cluster corresponding to the node a is lower than the preset intra-cluster purity threshold, which indicates that the node a is the target node corresponding to the data to be mined.
The graph data corresponding to the target node is screened from the graph data of the data set to be processed, for example, after the target node is determined to be the node a, the position of the node a in the cluster map of the graph data is obtained according to the cluster subgraph of the node a, the target area of the node a in the graph data can be continuously obtained based on the position of the node a in the cluster map of the graph data, the data corresponding to the node a in the data set is obtained according to the target area of the graph data, the data is used as the data to be mined in the data set to be processed, namely the data is the Badcase to be mined in the data set, and the intra-cluster purity of the data cluster corresponding to the next node is continuously calculated after the mining is completed until the intra-cluster purity of the data clusters corresponding to all the nodes in the graph data is completely calculated.
(2) And when the intra-cluster purity is not lower than the preset intra-cluster purity threshold, continuing the intra-cluster purity calculation of the data cluster corresponding to the next node.
For example, when the intra-cluster purity is not lower than the preset intra-cluster purity threshold, the data cluster corresponding to the next target node is acquired. For example, the intra-cluster purity of the data cluster corresponding to the node a is 0.9, and the preset intra-cluster purity threshold is 0.8, then the data cluster corresponding to the node a does not contain Badcase, the data cluster corresponding to the node B in the graph data is obtained, the intra-cluster purity of the data cluster corresponding to the node B is calculated, and the remaining nodes in the graph data corresponding to the data set to be processed are sequentially processed until the intra-cluster purity of the data cluster corresponding to all the nodes is calculated.
As can be seen from the above, in the embodiment of the present invention, feature extraction is performed on a data set to be processed to construct a feature space, node features are extracted from the feature space to generate graph data of the data set to be processed, where the graph data at least includes one node, a data cluster corresponding to the node is screened out from the graph data, data purity of the data cluster is calculated, intra-cluster purity of the data cluster is obtained, and when the intra-cluster purity is lower than a preset purity threshold, data corresponding to the node in the data set to be processed is obtained, so as to obtain mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the Bad files are evaluated through the purity in the data cluster, so that the Bad files are mined, the transition dependence on characteristic representation is reduced, the Bad files (Bad files) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the Bad files in the data is improved.
The method described in the above examples is further illustrated in detail below by way of example.
In this embodiment, the data mining apparatus will be described by taking an example in which the data mining apparatus is specifically integrated in an electronic device.
Training of graph recognition model
Firstly, the electronic device collects a plurality of data set samples, for example, data with known data types and quantities can be downloaded from the internet to form data clusters, the data clusters are formed into data set samples, the intra-cluster purity of the data clusters is calculated according to a calculation formula and labeled, and the data set samples with known data types and data and the intra-cluster purity corresponding to the data clusters in the data set samples can be uploaded to the data mining device.
Secondly, the electronic device can input the data set sample into a preset graph recognition model, feature extraction is carried out on the data set sample to construct a feature space, node features are extracted from the feature space to generate graph data of the data set sample, the graph data at least comprise one node, a data cluster corresponding to the node is screened out from the graph data, the data purity of the data cluster is calculated, and the predicted intra-cluster purity of the data cluster in the data set is obtained.
And thirdly, the electronic equipment converges the preset graph recognition model according to the predicted intra-cluster purity and the labeled cluster purity to obtain the trained graph recognition model. For example, a Dice function (a loss function) may be specifically adopted, the parameter for calculating the intra-cluster purity output in the graph recognition model is adjusted according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster labeled in the data set sample, and the parameter for calculating the intra-cluster purity output in the graph recognition model is adjusted according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster labeled in the data set sample by an interpolation loss function, so as to obtain the trained graph recognition model.
Optionally, in order to improve the accuracy of the context feature, besides the Dice function, other loss functions such as a cross entropy loss function may be used for convergence, which may specifically be as follows:
and adjusting parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting a cross entropy loss function, and adjusting the parameters for calculating intra-cluster purity output in the graph recognition model according to the intra-cluster purity and the predicted intra-cluster purity of the data cluster marked in the data set sample by adopting an interpolation loss function to obtain the trained graph recognition model.
And secondly, calculating the intra-cluster purity of the data cluster corresponding to the node in the graph data of the data set to be processed through the trained graph recognition model, and acquiring the data corresponding to the node in the data set to be processed when the intra-cluster purity of the data cluster is lower than a preset purity threshold value to obtain the mined data.
As shown in fig. 5, a data mining method specifically includes the following steps:
201. the electronic device obtains a dataset to be processed.
For example, the electronic device may obtain data from the internet, such as downloading or collecting, and compose the data into a data set, and may further include the user uploading the data to a server, and the data mining apparatus obtains the data uploaded by the user from the server and composes the data set, where the data set may include one type of data and may also include multiple types of data.
202. The electronic device performs feature extraction on the data set to be processed to construct a feature space.
For example, the electronic device may perform feature extraction on the data set to be processed by using a deep residual error network, and extract feature information of the data in the data set to be processed, such as feature information of a structure of the data, a relationship between the data and the data, and/or a type of the data. And arranging and storing the extracted feature information in a feature vector form to construct a feature space. For example, an overall structure of the feature space is constructed by using the relationship between the features, where the connection position in the overall structure may be an intersection point or a node between the features, the feature space is perfected or enriched by using the attributes of the features themselves to form a feature space containing all extracted feature information, and all the feature information is stored in the feature space.
203. The electronic equipment extracts node features in the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node.
For example, the electronic device extracts node features in the feature space, may classify the node features in a hierarchical clustering manner, and may also classify the node features by using a K-neighbor algorithm to obtain different types of node features, for example, the node features may be classified into a head node feature, a middle node feature, and a tail node feature according to different positions of the node features in the feature space.
The electronic device extracts corresponding node information from various kinds of node features according to the classification result, for example, extracting one or more node information of a head part from the head node feature, extracting one or more node information of a middle part from the middle node feature, and extracting one or more node information of a tail part from the tail node feature. The relationship tree is constructed according to the mentioned node information of each kind, for example, the relationship tree can be constructed according to the mutual relationship among the extracted node information, for example, the mutual relationship among one or more node information at the tail part is obtained, the node information of the root node is judged according to the mutual relationship, the root part of the relationship coefficient is constructed based on the node information of the following node, the nodes in the trunk upwards from the root node in the relationship tree are sequentially found out according to the obtained information of the root node, the nodes on the branch corresponding to the nodes on the trunk are searched in the rest node information according to the node information of the trunk on the trunk, the nodes are connected with each other, the branch and the trunk of the relationship tree are formed, and then the information of the leaf nodes on the branch is obtained according to the node information on the branch and the trunk, so that the construction of the relationship tree is completed.
The electronic equipment fills or fuses the characteristic attributes on the nodes into the relation tree according to the constructed relation tree, so that each node in the relation tree comprises one or more data, the data in the characteristics can be mapped to each connecting line in the relation tree to form the edges of the graph data, and finally the graph data of the data set to be processed is generated. The generated graph data structures and visualizes data in the data set to be processed, the mutual relations such as the position relation, the structure relation and the like between the data in the data set and the data can be intuitively reflected from the graph data, and the graph data can also comprise attribute information of the data.
204. And the electronic equipment screens the data clusters corresponding to the nodes in the graph data.
For example, the electronic device randomly selects a node in the graph data, uses the selected node as a target node, finds K neighboring nodes of the target node through cosine similarity, where K may be any value, wherein the neighboring nodes may include nodes directly adjacent to the target node, and may also include nodes that are separated from the target node by a distance within a predetermined distance threshold in the graph data, such as, for example, converting the target node and the rest other nodes into vectors in space in the graph data, using the cosine value of the included angle between the vector converted by the rest other nodes and the target vector converted by the target node as the measurement or judgment of the cosine distance between the rest other nodes and the target node, and judging through the cosine distances between the rest other nodes and the target node and a preset distance threshold, and taking the rest nodes with the cosine distances within the preset distance threshold as the adjacent nodes of the target node.
The electronic device performs hierarchical clustering on the target node and the corresponding neighboring node thereof in the graph data to obtain a cluster graph of the target node, for example, in the graph data, each node may be regarded as one or more data, the obtained target node and the corresponding neighboring node thereof may include multiple or multiple types of data, two most similar data in all data are combined by calculating similarity between multiple data of the target node and the corresponding neighboring node thereof, and the process is iterated repeatedly, and finally the data in the target node and the corresponding neighboring node thereof are divided into two or multiple types to generate the cluster graph.
The electronic device screens the data clusters corresponding to the target node in the cluster map, for example, in the cluster map, a cluster subgraph of the target node is generated according to the target node, and the cluster subgraph is used as the data clusters corresponding to the target node. The clustering subgraph can be regarded as a sub-topological graph formed by one or more data which are closest to or most similar to the target node clusters in the clustering graph. It can be seen that the data cluster contains a plurality of data, and the plurality of data may be one type or multiple types.
205. And the electronic equipment calculates the data purity of the data cluster to obtain the intra-cluster purity of the data cluster.
For example, the electronic device performs feature extraction on the data cluster by using the trained graph recognition model to obtain data information of the data cluster. For example, feature extraction may be performed on a data cluster through a Graph Convolutional neural Network (GCN) to obtain data information of the data cluster, which may specifically be as follows:
the electronic equipment transforms and sends the characteristic information of each node in the data cluster to respective adjacent nodes, the adjacent nodes at the moment comprise adjacent nodes directly connected through a sideline, each node gathers the characteristic information sent by each adjacent node to perform fusion of the characteristic information, and the data information in the data cluster is obtained, wherein the data information comprises the total amount of all data in the data cluster and the attribute information of all the data.
Wherein the selected target node vjAs a central vertex to the GCN model, the GCN model employs a target node vjAnd taking a clustering subgraph (data cluster) formed by the associated vertex sets on the visual features as input, and extracting the features of the clustering subgraph (data cluster) to obtain the data information of the clustering subgraph (data cluster). The calculation formula is as follows:
wherein, A (P)i) Is a target node vjVisually associated topsA cluster subgraph formed by the point set,andrespectively data information of any node with which the target node in the clustered sub-graph is associated,is a diagonal matrix, I is an identity matrix, Fl(Pi) Is a characteristic expression of layer I, WlIs the feature map learned by the l-th layer GCN model, and σ is the activation function. It should be noted here that the activation function may select ReLU (an activation function).
After the GCN model in the electronic device obtains the data information in the data cluster through feature extraction, the data in the data cluster is classified according to the data information, for example, the data is classified according to the attribute information of the data, and the data can also be classified according to the structure of the data, for example, the data with the same data structure can be classified into one type according to the structure of the data. And acquiring the quantity of each category of data and the total quantity of data of the data cluster in the data information according to the classification result. For example, according to the classification result, the data cluster can be classified into class a data, class B data and class C data, where the class a data includes data 1 and data 2, the class B data includes data 3 and data 4, the class C data includes data 5 and data 6, the number of the data 1 to the data 6 is obtained in the data information, based on the obtained data 1 to the data 6, the number of the class a data can be obtained as the sum of the number of the data 1 and the number of the data 2, and similarly, the number of the class B data and the class C data can be obtained, and the total number of all the data in the data cluster can also be obtained.
And screening the data with the most data from the quantity of each category data to be used as target data. For example, if the number of the class a data is 100, the number of the class B data is 20, and the number of the class C data is 10, the class a data with the largest number is screened from the A, B and the class C data, the class a data is used as the target data, or the class a data is used as the class represented by the data cluster, and the cluster purity of the corresponding data cluster is the data purity of the class a data. And calculating the ratio of the target data to the total data quantity of the data clusters to obtain the intra-cluster purity of the data clusters. The calculation formula is as follows:
wherein, purity (P)i,Cgt) In-cluster purity of target data, wkFor the total amount of data in the data cluster, Cgt=c1,c2,…,cMIs the result of the original classification, cjIs the amount of each data in the data cluster.
For example, if the number of the class a data is 100, the number of the class B data is 20, and the number of the class C data is 10, and the target data is the class a data, the intra-cluster purity of the data cluster is a ratio of 100 numbers of the class a data to 130 numbers of total numbers of data in the data cluster, and is approximately equal to 0.769.
206. And when the purity in the cluster is lower than a preset purity threshold value, the electronic equipment acquires the data corresponding to the nodes in the data set to be processed to obtain the mined data.
For example, when the intra-cluster purity is lower than the preset purity threshold, the electronic device determines a target node corresponding to the data, for example, if the intra-cluster purity of the data cluster corresponding to the node a is calculated to be 0.769, and the preset intra-cluster purity is 0.8, the intra-cluster purity of the data cluster corresponding to the node a is lower than the preset intra-cluster purity threshold, which indicates that the node a is the target node corresponding to the data to be mined.
The electronic device screens graph data corresponding to a target node from graph data of a data set to be processed, for example, after the target node is determined to be the node a, the position of the node a in a cluster map of the graph data is obtained according to a cluster subgraph of the node a, a target area of the node a in the graph data can be continuously obtained based on the position of the node a in the cluster map of the graph data, data corresponding to the node a in the data set is obtained according to the target area of the graph data, the data is used as data to be mined in the data set to be processed, namely the data is a Badcase to be mined in the data set, and after the mining is completed, intra-cluster purity of a data cluster corresponding to a next node is continuously calculated until the intra-cluster purity of the data clusters corresponding to all the nodes in the graph data is completely calculated.
207. And when the intra-cluster purity is not lower than the preset intra-cluster purity threshold, continuing the intra-cluster purity calculation of the data cluster corresponding to the next node.
For example, when the intra-cluster purity is not lower than a preset intra-cluster purity threshold, the electronic device acquires a data cluster corresponding to a next target node. For example, the intra-cluster purity of the data cluster corresponding to the node a is 0.9, and the preset intra-cluster purity threshold is 0.8, then the data cluster corresponding to the node a does not contain Badcase, the data cluster corresponding to the node B in the graph data is obtained, the intra-cluster purity of the data cluster corresponding to the node B is calculated, and the remaining nodes in the graph data corresponding to the data set to be processed are sequentially processed until the intra-cluster purity of the data cluster corresponding to all the nodes is calculated.
As can be seen from the above, in the embodiment, the electronic device performs feature extraction on a data set to be processed to construct a feature space, extracts node features in the feature space to generate graph data of the data set to be processed, where the graph data at least includes one node, screens out a data cluster corresponding to the node from the graph data, calculates the data purity of the data cluster, obtains the intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold, obtains data corresponding to the node in the data set to be processed, and obtains mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the Bad files are evaluated through the purity in the data cluster, so that the Bad files are mined, the transition dependence on characteristic representation is reduced, the Bad files (Bad files) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the Bad files in the data is improved.
In order to better implement the above method, an embodiment of the present invention further provides a data mining apparatus, which may be integrated in an electronic device, such as a server or a terminal, and the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 6, the data mining apparatus may include an extracting unit 301, a generating unit 302, a screening unit 303, a calculating unit 304, and an obtaining unit 305, as follows:
(1) an extraction unit 301;
an extracting unit 301, configured to perform feature extraction on the data set to be processed to construct a feature space.
The extracting unit 301 may include an obtaining subunit 3011 and an extracting subunit 3012, as shown in fig. 7, specifically as follows:
an obtaining subunit 3011, configured to obtain a data set to be processed;
a first extraction subunit 3012, configured to perform feature extraction on the data set to be processed to construct a feature space.
For example, the acquisition subunit 3011 acquires a data set to be processed, and the extraction subunit 3012 performs feature extraction on the data set to be processed to construct a feature space.
(2) A generation unit 302;
a generating unit 302, configured to extract node features in the feature space to generate graph data of the to-be-processed data set, where the graph data includes at least one node.
Wherein, the generating unit 302 may include a second extracting subunit 3021, a first classifying subunit 3022, and a generating subunit 3023, as shown in fig. 8;
a second extraction subunit 3021 configured to extract node features in the feature space;
a first classification subunit 3022, configured to classify the node features;
a generating subunit 3023, configured to generate graph data of the data set to be processed according to the classification result.
For example, the second extraction subunit 3021 extracts node features in the feature space, the first classification subunit 3022 classifies the node features, and the generation subunit 3023 generates graph data of the dataset to be processed according to the classification result.
(3) A screening unit 303;
a screening unit 303, configured to screen a data cluster corresponding to a node from the graph data;
wherein, the screening unit 303 may include a searching subunit 3031, a clustering subunit 3032 and a screening subunit 3033, as shown in fig. 9,
a search subunit 3031, configured to search for a neighboring node corresponding to a node in the graph data;
a clustering subunit 3032, configured to cluster the nodes and the corresponding neighboring nodes in the graph data to obtain a cluster graph of the nodes;
and a screening subunit 3033, configured to screen a data cluster corresponding to the node in the cluster map.
For example, the searching subunit 3031 searches for neighboring nodes corresponding to the node in the graph data, the clustering subunit 3032 clusters the node and the corresponding neighboring nodes in the graph data to obtain a cluster map of the node, and the screening subunit 3033 screens a data cluster corresponding to the node in the cluster map.
(4) A calculation unit 304;
and the calculating unit 304 is configured to calculate the data purity of the data cluster, and obtain the intra-cluster purity of the data cluster.
The calculating unit 304 may include a third extracting unit 3041, a second classifying unit 3042, and a calculating subunit 3043, as shown in fig. 10, specifically as follows:
a third extracting unit 3041, configured to perform feature extraction on the data cluster by using the trained graph recognition model to obtain data information of the data cluster;
a second classification subunit 3042, configured to classify the data in the data cluster according to the data information;
and a calculating subunit 3043, configured to calculate the data purity of the data cluster according to the classification result, so as to obtain the intra-cluster purity of the data cluster.
For example, the third extracting unit 3041 extracts features of the data cluster by using the trained graph recognition model to obtain data information of the data cluster, the second classifying subunit 3042 classifies data in the data cluster according to the data information, and the calculating subunit 3043 calculates the data purity of the data cluster according to the classification result to obtain the intra-cluster purity of the data cluster.
(5) An acquisition unit 305;
the acquiring unit 305 is configured to acquire data corresponding to the node in the to-be-processed data set when the intra-cluster purity is lower than a preset intra-cluster purity threshold, and use the data as data to be mined.
For example, when the cluster purity is lower than a preset intra-cluster purity threshold, determining a target node corresponding to a data cluster, screening graph data corresponding to the target node from the graph data of the data set to be processed, acquiring data corresponding to the node from the data set to be processed according to the graph data corresponding to the node, and taking the data as data to be mined in the data set to be processed; and when the intra-cluster purity is not lower than the preset intra-cluster purity threshold, continuing the intra-cluster purity calculation of the data cluster corresponding to the next node.
Optionally, the trained recognition model may be set by the operation and maintenance personnel in advance, or may be obtained by self-training of the graph recognition model. That is, as shown in fig. 11, the recognition model may further include an acquisition unit 306 and a training unit 307, as follows:
an acquisition unit 306 for acquiring a plurality of data set samples, the data cluster samples including data clusters labeled with intra-cluster purity.
For example, the collecting unit 306 may download data of known data types and quantities from the internet to form data clusters, form the data clusters into data set samples, calculate and label the intra-cluster purity of the data clusters according to a calculation formula, and upload the data set samples of known data types and data and the intra-cluster purity corresponding to the data clusters of the data set samples to the data mining apparatus.
And the training unit 307 is configured to predict the intra-cluster purity of the data set sample by using a preset map recognition model to obtain a predicted intra-cluster purity, and converge the preset map recognition model according to the predicted intra-cluster purity and the labeled cluster purity to obtain a trained map recognition model.
For example, the training unit 307 may specifically perform feature extraction on a data set sample to construct a feature space, extract node features in the feature space to generate graph data of the data set sample, where the graph data includes at least one node, screen a data cluster corresponding to the node from the graph data, calculate data purity of the data cluster, and obtain predicted intra-cluster purity of the data cluster in the data set, and then converge the preset graph recognition model according to the predicted intra-cluster purity and the labeled cluster purity, so as to obtain the trained graph recognition model.
In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.
As can be seen from the above, in this embodiment, in the extraction unit 301, feature extraction is performed on a data set to be processed to construct a feature space, the generation unit 302 extracts node features in the feature space to generate graph data of the data set to be processed, where the graph data at least includes one node, the screening unit 303 screens out a data cluster corresponding to the node from the graph data, the calculation unit 304 calculates the data purity of the data cluster to obtain the intra-cluster purity of the data cluster, and the acquisition unit 305 acquires data corresponding to the node in the data set to be processed when the intra-cluster purity is lower than a preset purity threshold to obtain mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the bad files are evaluated through the purity in the data cluster, so that the bad files are mined, the transition dependence on characteristic representation is reduced, the bad files (Badcase) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the bad files in the data is improved.
An embodiment of the present invention further provides an electronic device, as shown in fig. 12, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:
the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 12 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:
the method comprises the steps of extracting features of a data set to be processed to construct a feature space, extracting node features from the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node, screening out a data cluster corresponding to the node from the graph data, calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold, obtaining the data corresponding to the node in the data set to be processed to obtain the mined data.
For example, the data may be acquired from the internet, such as downloading or collecting, and forming a data set from the data, or the data may be uploaded to a server by a user, the data mining apparatus acquires a data set formed by the data uploaded by the user from the server, performs feature extraction on the data set to be processed by using a deep residual error network, extracts feature information of the data in the data set to be processed, extracts node features in a feature space, classifies the node features by hierarchical clustering, or classifies the node features by using a K-nearest neighbor algorithm to obtain different types of node features, extracts corresponding node information from each type of node features according to the classification result, constructs a relationship tree according to the extracted node information of each type, generates graph data of the data set to be processed according to the constructed relationship tree, and randomly selects a node in the graph data, selecting a node as a target node, searching an adjacent node corresponding to the target node by using the selected target node, performing hierarchical clustering on the target node and the adjacent node corresponding to the target node in graph data to obtain a cluster graph of the target node, screening a data cluster corresponding to the target node in the cluster graph, performing feature extraction on the data cluster by using a trained graph recognition model to obtain data information of the data cluster, classifying the data in the data cluster according to the data information after the GCN model obtains the data information in the data cluster through the feature extraction, for example, classifying the data according to the attribute information of the data, classifying the data according to the structure of the data, obtaining the quantity of each class of data and the total quantity of the data cluster in the data information according to the classification result, screening the data with the most data in the quantity of each class of data to be used as the target data, and calculating the ratio of the total amount of the target data and the data clusters to obtain the intra-cluster purity of the data clusters, acquiring the data corresponding to the nodes in the data set to be processed when the intra-cluster purity is lower than a preset intra-cluster purity threshold to obtain the mined data, and continuing to calculate the intra-cluster purity of the data clusters corresponding to the next node when the intra-cluster purity is not lower than the preset intra-cluster purity threshold.
Optionally, the trained graph recognition model may be set in advance by an operation and maintenance person, or may be obtained by self-training of the data mining device, that is, the instruction may further perform the following steps:
the method comprises the steps of collecting a plurality of data set samples, predicting the intra-cluster purity of the data set samples by adopting a preset graph recognition model to obtain the predicted intra-cluster purity, and converging the preset graph recognition model according to the predicted intra-cluster purity and the labeled cluster purity to obtain a trained graph recognition model.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
As can be seen from the above, in the embodiment of the present invention, feature extraction is performed on a data set to be processed to construct a feature space, node features are extracted from the feature space to generate graph data of the data set to be processed, where the graph data at least includes one node, a data cluster corresponding to the node is screened out from the graph data, data purity of the data cluster is calculated, intra-cluster purity of the data cluster is obtained, and when the intra-cluster purity is lower than a preset purity threshold, data corresponding to the node in the data set to be processed is obtained, so as to obtain mined data; according to the scheme, not only are all characteristic information in the data cluster examined, but also the Bad files are evaluated through the purity in the data cluster, so that the Bad files are mined, the transition dependence on characteristic representation is reduced, the Bad files (Bad files) in the data can be mined more quickly, efficiently and accurately, and the hit rate of the Bad files in the data is improved.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute steps in any data mining method provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
the method comprises the steps of extracting features of a data set to be processed to construct a feature space, extracting node features from the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node, screening out a data cluster corresponding to the node from the graph data, calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster, and when the intra-cluster purity is lower than a preset purity threshold, obtaining the data corresponding to the node in the data set to be processed to obtain the mined data.
For example, the data may be acquired from the internet, such as downloading or collecting, and forming a data set from the data, or the data may be uploaded to a server by a user, the data mining apparatus acquires a data set formed by the data uploaded by the user from the server, performs feature extraction on the data set to be processed by using a deep residual error network, extracts feature information of the data in the data set to be processed, extracts node features in a feature space, classifies the node features by hierarchical clustering, or classifies the node features by using a K-nearest neighbor algorithm to obtain different types of node features, extracts corresponding node information from each type of node features according to the classification result, constructs a relationship tree according to the extracted node information of each type, generates graph data of the data set to be processed according to the constructed relationship tree, and randomly selects a node in the graph data, selecting a node as a target node, searching an adjacent node corresponding to the target node by using the selected target node, performing hierarchical clustering on the target node and the adjacent node corresponding to the target node in graph data to obtain a cluster graph of the target node, screening a data cluster corresponding to the target node in the cluster graph, performing feature extraction on the data cluster by using a trained graph recognition model to obtain data information of the data cluster, classifying the data in the data cluster according to the data information after the GCN model obtains the data information in the data cluster through the feature extraction, for example, classifying the data according to the attribute information of the data, classifying the data according to the structure of the data, obtaining the quantity of each class of data and the total quantity of the data cluster in the data information according to the classification result, screening the data with the most data in the quantity of each class of data to be used as the target data, and calculating the ratio of the total amount of the target data and the data clusters to obtain the intra-cluster purity of the data clusters, acquiring the data corresponding to the nodes in the data set to be processed when the intra-cluster purity is lower than a preset intra-cluster purity threshold to obtain the mined data, and continuing to calculate the intra-cluster purity of the data clusters corresponding to the next node when the intra-cluster purity is not lower than the preset intra-cluster purity threshold.
Optionally, the trained graph recognition model may be set in advance by an operation and maintenance person, or may be obtained by self-training of the data mining device, that is, the instruction may further perform the following steps:
the method comprises the steps of collecting a plurality of data set samples, predicting the intra-cluster purity of the data set samples by adopting a preset graph recognition model to obtain the predicted intra-cluster purity, and converging the preset graph recognition model according to the predicted intra-cluster purity and the labeled cluster purity to obtain a trained graph recognition model.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the computer-readable storage medium can execute the steps in any data mining method provided by the embodiment of the present invention, the beneficial effects that can be achieved by any data mining method provided by the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The data mining method, the data mining device and the computer-readable storage medium provided by the embodiments of the present invention are described in detail above, and the principles and the embodiments of the present invention are explained in detail herein by applying specific examples, and the description of the embodiments above is only used to help understanding the method and the core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of data mining, comprising:
extracting features of a data set to be processed to construct a feature space;
extracting node features in the feature space to generate graph data of the data set to be processed, wherein the graph data at least comprises one node;
screening out data clusters corresponding to the nodes from the graph data;
calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster;
and when the intra-cluster purity is lower than a preset intra-cluster purity threshold value, acquiring data corresponding to the nodes in the data set to be processed to obtain mined data.
2. The data mining method of claim 1, wherein calculating the data purity of the data cluster to obtain the intra-cluster purity of the data cluster comprises:
extracting the characteristics of the data cluster by adopting a trained graph recognition model to obtain the data information of the data cluster;
classifying the data in the data cluster according to the data information;
and calculating the data purity of the data cluster according to the classification result to obtain the intra-cluster purity of the data cluster.
3. The data mining method of claim 2, wherein calculating the data purity of the data cluster according to the classification result to obtain the intra-cluster purity of the data cluster comprises:
acquiring the quantity of each category of data and the total quantity of data of the data cluster in the data information according to the classification result;
screening the data with the largest quantity from the quantity of each category of data to serve as target data;
and calculating the ratio of the target data to the total data quantity of the data clusters to obtain the intra-cluster purity of the data clusters.
4. The data mining method of claim 2, wherein before the feature extraction of the data clusters using the trained graph recognition model, the method further comprises:
collecting a plurality of data set samples, wherein the data set samples comprise data clusters marked with cluster purity;
predicting the cluster purity of the data set sample by adopting a preset graph recognition model to obtain the predicted cluster purity;
and converging the preset graph recognition model according to the predicted cluster purity and the marked cluster purity to obtain a trained graph recognition model.
5. The data mining method according to any one of claims 1 to 4, wherein when the intra-cluster purity is lower than a preset intra-cluster purity threshold, acquiring data corresponding to the node in the to-be-processed data set, and using the data as data to be mined, comprises:
when the cluster purity is lower than a preset intra-cluster purity threshold value, determining a target node corresponding to the data cluster;
screening graph data corresponding to the target node from the graph data of the data set to be processed;
and acquiring data corresponding to the nodes in the data set to be processed according to the graph data corresponding to the nodes, and taking the data as the data to be mined in the data set to be processed.
6. The data mining method of any one of claims 1 to 4, wherein screening the graph data for data clusters corresponding to the nodes comprises:
searching the graph data for a neighboring node corresponding to the node;
clustering the nodes and the corresponding adjacent nodes in the graph data to obtain a cluster graph of the nodes;
and screening the data clusters corresponding to the nodes in the cluster map.
7. The data mining method of any one of claims 1 to 4, wherein extracting node features in the feature space to generate graph data of the dataset to be processed, the graph data including at least one node, comprises:
extracting node features from the feature space;
classifying the node features;
and generating graph data of the data set to be processed according to the classification result.
8. The data mining method of claim 7, wherein generating graph data of the dataset to be processed according to the classification result comprises:
extracting node information in the node characteristics of each type according to the classification result;
constructing a relation tree according to the node information;
and generating graph data of the data set to be processed based on the constructed relation tree.
9. A data mining device, comprising:
the extraction unit is used for extracting the features of the data set to be processed so as to construct a feature space;
a generating unit, configured to extract node features in the feature space to generate graph data of the to-be-processed data set, where the graph data includes at least one node;
the screening unit is used for screening out the data clusters corresponding to the nodes from the graph data;
the computing unit is used for computing the data purity of the data cluster to obtain the intra-cluster purity of the data cluster;
and the acquisition unit is used for acquiring the corresponding data of the nodes in the data set to be processed to obtain the mined data when the purity in the cluster is lower than a preset purity threshold value.
10. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the data mining method of claims 1 to 8.
CN201910801360.4A 2019-08-28 2019-08-28 Data mining method and device and computer readable storage medium Pending CN110598065A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910801360.4A CN110598065A (en) 2019-08-28 2019-08-28 Data mining method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910801360.4A CN110598065A (en) 2019-08-28 2019-08-28 Data mining method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110598065A true CN110598065A (en) 2019-12-20

Family

ID=68855899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910801360.4A Pending CN110598065A (en) 2019-08-28 2019-08-28 Data mining method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110598065A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160468A (en) * 2019-12-30 2020-05-15 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111211864A (en) * 2019-12-25 2020-05-29 安徽机电职业技术学院 Data transmission error processing method and system
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN111340084A (en) * 2020-02-20 2020-06-26 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN113408564A (en) * 2020-10-21 2021-09-17 腾讯科技(深圳)有限公司 Graph processing method, network training method, device, equipment and storage medium
CN113408945A (en) * 2021-07-15 2021-09-17 广西中烟工业有限责任公司 Method and device for detecting purity of flue-cured tobacco, electronic equipment and storage medium
CN115587140A (en) * 2022-12-09 2023-01-10 四川新迎顺信息技术股份有限公司 Electronic engineering project data visual management method and device based on big data
CN117273765A (en) * 2023-11-21 2023-12-22 广州欧派创意家居设计有限公司 Multistage dealer circulation data processing method and system based on automatic check

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111211864B (en) * 2019-12-25 2022-07-29 安徽机电职业技术学院 Data transmission error processing method and system
CN111211864A (en) * 2019-12-25 2020-05-29 安徽机电职业技术学院 Data transmission error processing method and system
CN111160468A (en) * 2019-12-30 2020-05-15 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111160468B (en) * 2019-12-30 2024-01-12 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN111340084A (en) * 2020-02-20 2020-06-26 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111340084B (en) * 2020-02-20 2024-05-17 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN113408564A (en) * 2020-10-21 2021-09-17 腾讯科技(深圳)有限公司 Graph processing method, network training method, device, equipment and storage medium
CN113408945A (en) * 2021-07-15 2021-09-17 广西中烟工业有限责任公司 Method and device for detecting purity of flue-cured tobacco, electronic equipment and storage medium
CN115587140A (en) * 2022-12-09 2023-01-10 四川新迎顺信息技术股份有限公司 Electronic engineering project data visual management method and device based on big data
CN115587140B (en) * 2022-12-09 2023-03-28 四川新迎顺信息技术股份有限公司 Electronic engineering project data visual management method and device based on big data
CN117273765A (en) * 2023-11-21 2023-12-22 广州欧派创意家居设计有限公司 Multistage dealer circulation data processing method and system based on automatic check
CN117273765B (en) * 2023-11-21 2024-02-06 广州欧派创意家居设计有限公司 Multistage dealer circulation data processing method and system based on automatic check

Similar Documents

Publication Publication Date Title
CN110598065A (en) Data mining method and device and computer readable storage medium
WO2022068196A1 (en) Cross-modal data processing method and device, storage medium, and electronic device
KR101130524B1 (en) Automatic data perspective generation for a target variable
CN110782015A (en) Training method and device for network structure optimizer of neural network and storage medium
CN112765477B (en) Information processing method and device, information recommendation method and device, electronic equipment and storage medium
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN108875090B (en) Song recommendation method, device and storage medium
CN108280236B (en) Method for analyzing random forest visual data based on LargeVis
CN111708823B (en) Abnormal social account identification method and device, computer equipment and storage medium
Mall et al. Representative subsets for big data learning using k-NN graphs
CN113761359B (en) Data packet recommendation method, device, electronic equipment and storage medium
CN111382190A (en) Object recommendation method and device based on intelligence and storage medium
CN112000763A (en) Method, device, equipment and medium for determining competition relationship of interest points
CN112817563B (en) Target attribute configuration information determining method, computer device, and storage medium
CN113094448B (en) Analysis method and analysis device for residence empty state and electronic equipment
Hu et al. [Retracted] Evaluation Method of Wushu Teaching Quality Based on Fuzzy Clustering
CN114238764A (en) Course recommendation method, device and equipment based on recurrent neural network
CN112418256A (en) Classification, model training and information searching method, system and equipment
CN111310072B (en) Keyword extraction method, keyword extraction device and computer-readable storage medium
CN110909193B (en) Image ordering display method, system, device and storage medium
CN111709473A (en) Object feature clustering method and device
CN111382793A (en) Feature extraction method and device and storage medium
CN111768214A (en) Product attribute prediction method, system, device and storage medium
CN109977030A (en) A kind of test method and equipment of depth random forest program
CN114064897A (en) Emotion text data labeling method, device and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40019560

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination