CN113722554A

CN113722554A - Data classification method and device and computing equipment

Info

Publication number: CN113722554A
Application number: CN202110375946.6A
Authority: CN
Inventors: 李高; 刘肖; 吴鸣; 李志颖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2021-11-30

Abstract

The embodiment of the application provides a data classification method, a data classification device and computing equipment, wherein the method comprises the following steps: acquiring N data to be classified, determining the similarity between every two data in the N data, and determining two data with the similarity larger than a first preset value as a relationship pair according to the similarity between every two data in the N data, so as to obtain M relationship pairs; according to the incidence relation and the arrangement sequence between the M relation pairs, N data are pre-clustered to obtain P first communication components, then Q second communication components are obtained according to the incidence relation between the first communication components in the P first communication components, and the data in each second communication component in the Q second communication components are determined to be the same type of data, so that the problems that when the communication components are calculated, the calculation complexity is high and the time is consumed due to the fact that the number of nodes and the number of edges are too many are solved.

Description

Data classification method and device and computing equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data classification method, a data classification device and computing equipment.

Background

With the development of big data, the big data needs to be classified, for example, when a merchant conducts a transaction through a payment platform, the merchant needs to register on the payment platform, and then a large amount of merchant registration information is obtained. In order to facilitate management of the merchant, the merchant registration information needs to be classified.

For example, the data classification is realized by using a clustering algorithm of similarity calculation, specifically, a similarity matrix of the data is calculated, the similarity matrix is used as an input of a composition to obtain an undirected graph, connected components are searched in the undirected graph, data in the same connected component is determined as data of one type, and then the data classification is realized.

However, the calculation of the connected components at present takes a long time, resulting in low data classification efficiency.

Disclosure of Invention

The embodiment of the application provides a data classification method, a data classification device and a calculation device, which are used for reducing the calculation duration of a connected component and further improving the data classification efficiency.

In a first aspect, an embodiment of the present application provides a data classification method, including:

acquiring N data to be classified, and determining the similarity between every two data in the N data;

determining two data with the similarity larger than a first preset value as a relationship pair according to the similarity between every two data in the N data to obtain M relationship pairs;

pre-clustering the N data according to the incidence relation and the arrangement sequence among the M relation pairs to obtain P first connected components;

obtaining Q second connected components according to the incidence relation among the first connected components in the P first connected components;

Determining data in each of the Q second connected components as the same type of data;

wherein N, M, P, Q are all positive integers, and Q is less than or equal to P.

In a second aspect, an embodiment of the present application provides a data classification apparatus, including:

an acquisition unit configured to acquire N data to be classified;

a similarity determining unit, configured to determine a similarity between every two pieces of data in the N pieces of data;

the relation pair determining unit is used for determining two data with the similarity larger than a first preset value as a relation pair according to the similarity between every two data in the N data to obtain M relation pairs;

the pre-clustering unit is used for pre-clustering the N data according to the incidence relation and the arrangement sequence among the M relation pairs to obtain P first connected components;

a connected component determining unit, configured to obtain Q second connected components according to an association relationship between first connected components in the P first connected components;

the classification unit is used for determining the data in each second connected component in the Q second connected components as the same type of data;

Wherein N, M, P, Q are all positive integers, and Q is less than or equal to P.

In some embodiments, the pre-clustering unit is specifically configured to, for an ith relationship pair in the arrangement order of the M relationship pairs, if first data that is the same as a first communication component of a current pre-clustering exists in the ith relationship pair, connect the ith relationship pair with the first data in the first communication component, where i is a positive integer less than or equal to M; if the ith relation pair does not have first data which is the same as the first communication component of the current pre-clustering, constructing a new first communication component by taking the data in the ith relation pair as an initial node of the new first communication component; and if the two data in the ith relationship pair are respectively located in two different first communication components, determining the ith relationship pair as the relationship pair between the two different first communication components.

In some embodiments, the connected component determining unit is further configured to determine, according to a similarity between every two data of the N data, two data whose similarity is smaller than or equal to the first preset value as independent nodes, to obtain K independent nodes, where K is a positive integer; determining each of the independent nodes as a first connected component.

In some embodiments, the connected component determining unit is specifically configured to, for each first connected component in the P first connected components, if there is a relationship pair between the first connected component and another first connected component, combine the first connected component and the another first connected component into one second connected component; and if no relation pair exists between the first communication component and other first communication components, determining the first communication component as a second communication component, wherein the other first communication components are first communication components except the first communication component in the P first communication components.

In some embodiments, the connected component determining unit is specifically configured to take each of the P first connected components as a node, take a connection line of a relationship pair between two different first connected components as an edge, and perform connected component calculation to obtain the Q second connected components.

In some embodiments, the connected component determining unit is specifically configured to use each first connected component in the P first connected components as a node, use a connection line of a relationship pair between two different first connected components as an edge, and use a generalized first search method or a depth first search method to obtain the Q second connected components.

In some embodiments, the similarity determining unit is specifically configured to divide the N data into at least one first data group according to attribute information of each data of the N data; and sending the data of each first data group to different computing equipment for similarity calculation, and obtaining the similarity between every two data in each first data group from the different computing equipment.

In some embodiments, the similarity determining unit is specifically configured to, for each first data group in the at least one first data group, if a data amount in the first data group is greater than a second preset value, divide data in the first data group into F data blocks according to a data amount threshold in a preset data block, where F is a positive integer; for every two data blocks in the F data blocks, sending the two data blocks to a first computing device so that the first computing device calculates the similarity between every two data blocks in the two data blocks.

In some embodiments, the similarity determining unit is specifically configured to add 1 to an integer division result of the data amount in the first data group and the data amount threshold, as the number of data blocks corresponding to the first data group; and averagely dividing the data in the first data group into F data blocks according to the data quantity in the first data group and the data block quantity corresponding to the first data group.

In some embodiments, the similarity determining unit is specifically configured to obtain R combinations of two or more data blocks according to different combinations of two or more data blocks in the F data blocks, where the R combinations of two or more data blocks include the same two or more data blocks in the F data blocks and different two or more data blocks, and R is a positive integer; and aiming at each pairwise data block combination in the R pairwise data block combinations, sending two data blocks in the pairwise data block combinations to the first computing device.

In some embodiments, the similarity determining unit is specifically configured to obtain a data block matrix of FXF according to different combinations of every two data blocks in the F data blocks; and combining every two data blocks positioned at the upper triangle in the data block matrix of FXF to determine the combination of the R two data blocks.

In some embodiments, the similarity determining unit is further configured to send the data in the first data group to a second computing device for similarity calculation if the data amount in the first data group is less than or equal to the second preset value.

In some embodiments, if the attribute information of the data includes a naming mode, the obtaining unit 11 is specifically configured to perform word segmentation and part-of-speech tagging on each data of the N data to obtain the naming mode of each data; and dividing the data with the same naming mode in the N data into a first data group.

In a third aspect, embodiments of the present application provide a computing device, comprising a processor and a memory;

the memory for storing a computer program;

the processor is configured to execute the computer program to implement the method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which includes computer instructions, which when executed by a computer, cause the computer to implement the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which includes a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of a computer, and the execution of the computer program by the at least one processor causes the computer to implement the method of the first aspect.

According to the data classification method, the data classification device and the computing equipment, the similarity between every two data in the N data is determined, and according to the similarity between every two data in the N data, the two data with the similarity larger than the first preset value are determined to be a relation pair, so that M relation pairs are obtained. According to the incidence relation and the arrangement sequence among the M relation pairs, N data are pre-clustered to obtain P first communication components, then Q second communication components are obtained according to the incidence relation among the first communication components in the P first communication components, and the data in each second communication component in the Q second communication components are determined to be the same type of data, so that the data classification is realized. Therefore, when the connected component is calculated, the connected component is firstly pre-clustered, and on the basis of pre-clustering, the connected component is calculated, so that the problems of high calculation complexity and time consumption caused by excessive node number and edge number are solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is an alternative structural diagram of a distributed system applied to a blockchain system according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of the principle of the blob algorithm for similarity calculation;

fig. 3 is a schematic flowchart of a data classification method according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a connected component;

FIG. 5A is a schematic diagram of a first connected component;

FIG. 5B is a schematic diagram of an undirected graph;

FIG. 6 is a flowchart illustrating a data classification method according to another embodiment of the present application;

FIG. 7 is a schematic diagram of the partitioning of data groups;

FIG. 8 is a schematic diagram illustrating a similarity calculation principle according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a data classification method according to another embodiment of the present application;

fig. 10 is a schematic view of an application scenario according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating comparison of effects before and after optimization in spark-based similarity calculation;

FIG. 12 is a schematic diagram showing comparison of effects before and after optimization for connected components

Fig. 13 is a schematic structural diagram of a data classification apparatus according to an embodiment of the present application;

Fig. 14 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be understood that, in the present embodiment, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "plurality" means two or more than two unless otherwise specified.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Machine learning can be classified into unsupervised learning (unsupervised learning) and supervised learning (supervised learning), and both unsupervised and supervised learning are common usage modes in the industry. In industrial application, both unsupervised learning and supervised learning are often faced with the problems of large data volume and insufficient calculation power. Distributed systems are currently used to solve the above-mentioned problems of large data volume and insufficient computing power.

The system related to the embodiment of the application can be a distributed system formed by connecting a client, a plurality of nodes (any form of computing equipment in an access network, such as a server and a user terminal) through a network communication mode.

In some embodiments, taking a distributed system as an example of a blockchain system, referring To fig. 1, fig. 1 is an optional structural diagram of a blockchain system applied To a distributed system 100 provided in this embodiment of the present application, where the blockchain system is formed by a plurality of nodes (any form of computing device in an access network, such as a server and a user terminal) and a client, and a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and a P2P Protocol is an application layer Protocol running on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

In some embodiments, referring to the functionality of each node in the blockchain system shown in fig. 1, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

Fig. 2 is a schematic diagram of the principle of the blob algorithm for similarity calculation, as shown in fig. 2, which mainly includes: firstly, similarity matrix calculation is carried out on data, then the similarity matrix is used as the input of composition, connected component calculation of the graph is carried out, and finally the algorithm result is analyzed and applied.

However, the existing clustering algorithm based on similarity calculation has high algorithm complexity and long time consumption in calculating the connected component.

In order to solve the technical problem, the similarity between every two data in the N data is determined, and according to the similarity between every two data in the N data, the two data with the similarity larger than a first preset value are determined to be a relationship pair, so that M relationship pairs are obtained. According to the incidence relation and the arrangement sequence among the M relation pairs, N data are pre-clustered to obtain P first communication components, then Q second communication components are obtained according to the incidence relation among the first communication components in the P first communication components, and the data in each second communication component in the Q second communication components are determined to be the same type of data, so that the data classification is realized. Therefore, when the connected component is calculated, the connected component is firstly pre-clustered, and on the basis of pre-clustering, the connected component is calculated, so that the problems of high calculation complexity and time consumption caused by excessive node number and edge number are solved.

Example 1

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

Fig. 3 is a flowchart illustrating a data classification method according to an embodiment of the present application. As shown in fig. 3, the method of the embodiment of the present application includes:

s301, N data to be classified are obtained, and the similarity between every two data in the N data is determined, wherein N is a positive integer.

The execution subject of the embodiment of the present application is a device having a data classification function, for example, a data classification device. In some embodiments, the data classification apparatus is a computing device, such as a node in fig. 1. In some embodiments, the data classification apparatus is a unit having a data processing function in a computing device, for example, a processor in the computing device. The embodiment of the present application takes an execution subject as an example of a computing device.

The embodiment of the application can be used in an unsupervised model or a deep learning model, for example, in an off-line training process of the model or an on-line computing process.

The data is not limited in the present application and may be data of any scene, for example, if the merchant is registered, the data may be registration information of the merchant.

Similarity calculation is the basis of algorithm implementation in many machine learning algorithms (e.g., KNN (K-Nearest Neighbor, most adjacent rule classification), K-Means clustering, and DBSCAN (sensitivity-based spatial clustering of applications with noise)). For a set D of entities (e.g., users or businesses), the set including N data, which are also referred to as N elements, the similarity between every two elements in the set D is calculated.

In some embodiments, the similarity between two elements is the distance between the two elements.

In some embodiments, if the amount of data is large, the distributed similarity calculation may be performed using a cartesian product.

The specific way of calculating the similarity between two elements is not limited in the present application.

S302, according to the similarity between every two data in the N data, determining the two data with the similarity larger than the first preset value as a relation pair to obtain M relation pairs.

In the step S301, the similarity between every two data in the N data may be determined, and the step may determine which two data may form a relationship pair according to the similarity between the two data, specifically, determine two data with a similarity greater than a first preset value in the N data as a relationship pair, and further obtain M relationship pairs, where M is a positive integer.

For example, the N data include data a, data B, and data C, where the similarity between the data a and the data B is 0.8, the similarity between the data a and the data C is 0.7, and the similarity between the data B and the data C is 0.4. Assume that the first preset value is 0.6, so that data a and data B form a relationship pair (a, B), data a and data C form a relationship pair (a, C), and data B and data C cannot form a relationship pair, which is a single data node.

It should be noted that the first preset value of 0.6 is an example, and the value of the first preset value includes, but is not limited to, 0.6, and the specific value of the first preset value is not limited in this application and is specifically determined according to actual needs.

S303, pre-clustering the N data according to the incidence relation and the arrangement sequence among the M relation pairs to obtain P first connected components.

The concept of the correlation of connected components is first introduced.

In the graph theory, an undirected graph G is defined as composed of a finite set of non-empty vertices v (G) and a finite set of edges e (G), and is expressed as the following formula (1):

G＝(V(G),E(G)) (1)

where each element of E (G) is an unordered pair of vertices in V (G), called the edge of G.

Connectivity of the vertices: in the undirected graph G, if there is a path from the vertex vi to the vertex vj, the vertex vi and the vertex vj are said to be connected.

A connected graph: in the undirected graph G, if any two different vertices in v (G) are connected (i.e., have paths), G is called a connected graph.

Connected component: in graph theory, the greatly connected subgraph of the undirected graph G is referred to as the connected component, where any two vertices are connected to each other by a path. Any connected component of the connected graph is only one, namely the connected graph, and the unconnected undirected graph is composed of a plurality of connected components. As shown in FIG. 4, graph G is constructed from 3 greatly connected subgraphs. Any two vertices in the 3 connected components are connected to each other. No path is connected between the connected components.

The methods for calculating connected components of a graph according to the embodiments of the present application include, but are not limited to, breadth-first search (BFS) or depth-first search (DFS). Taking the depth-first search algorithm as an example, the logic of the algorithm for searching connected components of the undirected graph is as follows:

the DFS starts from a certain vertex v of the graph G and visits any adjacent vertex w1 of the graph G, and then starts from w1 and visits adjacent vertices which have not been visited until all adjacent vertices are visited. Next, a step is rolled back to the previous visited vertex to see if there are any more adjacent vertices that have not been visited. And so on, repeating the process until all the vertexes in the undirected graph are visited. All connected components of the undirected graph can be obtained through DFS.

As can be seen from the above, in the conventional connected component acquisition, an undirected graph is constructed first, and then connected component search is performed in the undirected graph to acquire connected components of the undirected graph. For example, the complexity of the algorithm for obtaining the connected component by the DFS is O (V + E), i.e., V is the number of nodes in the graph and E is the number of edges in the graph. When the number of edges is too large, the overall complexity is increased, and the calculation time is greatly increased. In order to solve the technical problem, the method optimizes the obtained connected components, specifically performs pre-clustering in the process of constructing the undirected graph to obtain one or more first connected components, and obtains a final second connected component on the basis of the first connected components.

When the undirected graph is constructed, the M relationship pairs are input one by one according to the arrangement sequence, and the arrangement sequence of the M relationship pairs can also be understood as the input sequence of the M relationship pairs. In the input process of the relation pairs, pre-clustering is carried out on the N data according to the incidence relation and the arrangement sequence among the input M relation pairs to obtain P first connected components.

The pre-clustering of the application is to connect the input relationship pairs with the incidence relationship together according to the arrangement sequence of the relationship pairs to form a first connected component, and one first connected component can be understood as one clustering.

For example, as shown in fig. 5A, M is 16, and the arrangement order of the 16 relationship pairs is: (a1, a2), (a2, a3), (a3, a4), (a4, a5), (a4, a6), (a6, a7), (a6, a8), (a8, a9), (a8, a10), (a12, a13), (a13, a14), (a13, a15), (a14, a15), (a11, a10), (a16, a17), (a16, a 18). The 16 relationship pairs are sequentially input into the building model of the wireless graph in the order of arrangement, wherein the relationship pairs (a1, a2), (a2, a3), (a3, a4), (a4, a5), (a4, a6), (a6, a7), (a6, a8), (a8, a9) (a8, a10) have an association relationship, and can be connected together to form the first connected component 1 shown in fig. 5A. When the relation pair (a12, a13) is input, the relation pair (a12, a13) has no association relation with each relation pair in the first connected component 1, when the first connected component 1 is constructed, the relation pair (a12, a13) is used as a starting node to start constructing a new first connected component 2, the relation pair (a12, a13) and the relation pair (a13, a14), (a13, a15), (a14, a15) which are input in succession are specifically associated and can be connected with each other to form the first connected component 2, when the relation pair (a11, a10) is input, one element a11 in the relation pair is located in the first connected component 2, and the other element a10 is located in the first connected component 1, so that the first connected component 1 and the first connected component 2 have a relation pair (a11, a 10). When the relation pair (a16, a17) is input, the relation pair (a16, a17) has no association relation with each relation pair in the first connected component 2, the first connected component 2 is constructed at the end, the relation pair (a16, a17) is used as a starting node to start constructing a new first connected component 3, and the relation pair (a16, a17) and the relation pair (a16, a18) input subsequently are specifically associated and can be connected with each other to form the first connected component 3.

Further, the process of constructing the first communication component by each of the M relationship pairs is the same, and S303 is described below by taking any one of the M relationship pairs as an example.

In some embodiments, for the ith relationship pair in the permutation order of the M relationship pairs, if the first data identical to the first connected component of the current pre-clustering exists in the ith relationship pair, connecting the ith relationship pair with the first data in the first connected component, where i is a positive integer less than or equal to M; if the ith relation pair does not have first data which is the same as the first communication component of the current pre-clustering, constructing a new first communication component by taking the data in the ith relation pair as an initial node of the new first communication component; and if the two data in the ith relation pair are respectively located in two different first communication components, determining the ith relation pair as the relation pair between the two different first communication components.

Continuing with the example shown in fig. 5A, assuming that the ith relationship pair is (a8, a10) as described above, the first connected component of the current pre-clustering is the first connected component 1, it is determined whether the first data identical to the first connected component 1 exists in the ith relationship pair (a8, a10), and it is determined that the first data a8 identical to the first connected component 1 exists in the ith relationship pair (a8, a10), and the relationship pair (a8, a10) is connected to the first data a8 in the first connected component 1, that is, the data a10 is connected to the first data a 8.

Assuming that the ith relationship pair is (a12, a13) above, the first connected component of the current pre-clustering is the first connected component 1, it is determined whether the first data identical to the first connected component 1 exists in the ith relationship pair (a12, a13), it is determined that the first data identical to the first connected component 1 does not exist in the ith relationship pair (a12, a13), and the relationship pair (a12, a13) is used as the start node of the new first connected component 2 to construct a new first connected component 2.

Assuming that the ith relationship pair is (a10, a11) where data a10 is located in first communication component 1 and data a11 is located in first communication component 2, the relationship pair (a10, a11) is determined as the relationship pair between first communication component 1 and first communication component 2.

Referring to the method described above, the method described above is performed for each of the M pairs of relationships resulting in one or more first connected components.

The first communication component formed by the above method includes at least two data.

In some embodiments, as shown in fig. 5A, the first communication component 4 further comprises one data, e.g. the first communication component 4 comprises data a 19.

Specifically, according to the similarity between every two data in the N data, determining two data with the similarity smaller than or equal to a first preset value as independent nodes to obtain K independent nodes, wherein K is a positive integer; each individual node is determined to be a first connected component.

That is, according to the similarity between every two data in the N data, M relationship pairs and K independent nodes are obtained. Referring to the above method, one or more first connected components may be obtained from the M relationship pairs, the first connected components including at least two data. In addition, each of the K independent nodes may be determined as one first connected component. And finally obtaining P first connected components of the N data, wherein each first connected component in the P first connected components at least comprises one data.

S304, obtaining Q second connected components according to the incidence relation among the first connected components in the P first connected components.

According to the method and the device, P first communication components are obtained by pre-clustering N data, and Q second communication components are obtained according to the incidence relation among the first communication components in the P first communication components. Compared with the method shown in fig. 5B, after the undirected graph is constructed, the connected component is searched in the undirected graph, and the method and the device realize the calculation of the connected component in the construction process of the undirected graph, reduce the calculation complexity of the connected component, save the calculation time of the connected component, and improve the calculation efficiency of the connected component.

In the above S304, the ways of obtaining Q second connected components according to the association relationship between the first connected components in the P first connected components include, but are not limited to, the following:

in a first aspect, the step S304 includes: for each first communication component in the P first communication components, if the first communication component and other first communication components have a relationship pair, combining the first communication component and other first communication components into a second communication component; and if no relation pair exists between the first communication component and other first communication components, determining the first communication component as a second communication component, wherein the other first communication components are first communication components except the first communication component in the P first communication components.

For example, as shown in fig. 5A, a specific pair of relationships (a11, a10) between the first communication component 1 and the first communication component 2, and therefore, the first communication component 1 and the first communication component 2 can be combined into one second communication component 1. And the first communication component 3 is not in a relationship pair with the first communication component 1 and the first communication component 2, and the first communication component 3 is determined as the second communication component 2.

And in the second mode, each first connected component in the P first connected components is taken as a node, a connecting line of a relation pair between two different first connected components is taken as an edge, and connected component calculation is carried out to obtain Q second connected components.

For example, each first connected component in P first connected components is taken as a node, a connecting line of a relationship pair between two different first connected components is taken as an edge, and a connected component calculation model is input to obtain Q second connected components output by the connected component calculation model, wherein Q is a positive integer less than or equal to P.

In one example, each first connected component in the P first connected components is taken as a node, a connection line of a relationship pair between two different first connected components is taken as an edge, and a generalized preferential search method or a depth preferential search method is used to obtain Q second connected components.

For example, as shown in fig. 5A, a first connected component 1, a second connected component 2, a first connected component 3, and a second connected component 4 are respectively used as a node, a relation pair connecting line between the first connected component 1 and the second connected component 2 is used as an edge, and finally, 3 second connected components are obtained through calculation. The calculation amount of 19 nodes and 17 edges originally is converted into the calculation amount of 4 nodes and 1 edge, and the calculation amount of the algorithm is greatly reduced, so that the calculation efficiency of the connected component is improved.

S305, determining the data in each second connected component in the Q second connected components as the same type of data.

As can be seen from the above, each relationship pair in the second connected components has an associated relationship, and therefore, the types of data located in one second connected component are identical, so that by calculating Q second connected components of N data, the N data are divided into Q types.

In this embodiment, during the calculation of the connected components, N data are pre-clustered according to the association relationship and the arrangement order between the relationship pairs to obtain P first connected components, then Q second connected components are obtained according to the association relationship between the first connected components in the P first connected components, and the data in each second connected component in the Q second connected components are determined as the same type of data, so as to implement the classification of the N data. Therefore, when the connected component is calculated, the connected component is firstly pre-clustered, and on the basis of pre-clustering, the connected component is calculated, so that the problem that the calculation is time-consuming due to the fact that the number of nodes and the number of edges are too many is solved, and the efficiency of data classification is improved.

Example 2

The determination of the similarity between each two data of the N data in S301 described above will be described in detail below.

Fig. 6 is a schematic flowchart of a data classification method according to another embodiment of the present application, and as shown in fig. 6, the step S301 includes:

S401, dividing the N data into at least one first data group according to the attribute information of each data in the N data.

S402, sending the data of each first data group to different computing equipment for similarity calculation, and obtaining the similarity between every two data in each first data group from different computing equipment.

For a service set D, in order to calculate the similarity (e.g. distance) between two data in the set D, the complexity of calculation is o (N) ═ N²And D is equal to N. So as the set N grows, the computational complexity grows exponentially.

The number of data collected in an actual service often reaches millions or even tens of millions, so that in an actual algorithm, through a spark and other distributed framework, data is firstly grouped according to attribute information of the data, for example, according to attribute information of each data in N data, the N data is divided into at least one first data group.

In some embodiments, the data includes registration information of the merchant, the registration information includes a merchant name of the merchant, and in this case, the attribute information of the data includes a naming pattern, which may be understood as a naming style of the merchant name of the merchant. At this time, in the above S401, the N data are grouped according to the attribute information of the data, and may be that each data in the N data is subjected to word segmentation and part-of-speech tagging to obtain a naming mode of each data; and dividing the data with the same naming mode in the N data into a first data group.

Wherein the data between each first data group is not similar to each other.

The data of each first data group is distributed to different computing devices for similarity calculation, for example, the data of different first data groups are distributed to different nodes shown in fig. 1 for similarity calculation.

In one example, as shown in fig. 7, N data are divided into 3 first data groups, which are respectively denoted as a first data group 1, a first data group 2, and a first data group 3. Sending the first data group 1 to the computing device 1, so that the computing device 1 calculates the similarity between every two data in the first data group 1; sending the first data group 2 to the computing device 2, so that the computing device 2 calculates the similarity between every two data in the first data group 2; the first data set 3 is sent to the computing device 3 such that the computing device 3 calculates the similarity between every two data within the first data set 3.

According to the method and the device, through the grouped distributed computation, the computation efficiency of the similarity can be improved to a great extent, but data inclination exists.

Data skewing is a problem often encountered in distributed data processing. In distributed processing, the amount of data distributed to each machine node (e.g., the node in fig. 1) for processing should be generally consistent under normal conditions, so that the processing time of each node is not very different, and the distributed processing can achieve the best performance. However, in practice, the amount of data distributed to each machine node for processing is often uneven for various reasons, and when the amount of data distributed to some nodes is significantly larger than that of other nodes, the situation is called data skew.

When data skew occurs in distributed processing, the completion time of the whole task is determined by the node which processes the most data. The nodes with large data processing amount have large calculation amount and long calculation time, so that the execution time of the whole task is prolonged, and other nodes already process the distributed data, so that the resources are vacant and cannot be utilized.

In the process of calculating the similarity, distributed parallel calculation is carried out in groups, the overall calculation efficiency can be effectively improved, and the data skew in the processing process is caused by the fact that the quantity of element data of certain groups is more than that of other groups under a single actual condition. Aiming at the data inclination condition, the distributed optimization method based on the distribution threshold value is provided, the occurrence of data inclination can be effectively avoided, and the execution efficiency of the whole task is guaranteed.

In some embodiments, S402 includes the following S402-1 to S402-3:

s402-1, aiming at each first data group in at least one first data group, if the data volume in the first data group is larger than a second preset value, dividing the data in the first data group into F data blocks according to a data volume threshold value in a preset data block, wherein F is a positive integer;

s402-2, aiming at every two data blocks in the F data blocks, sending the two data blocks to the first computing device, so that the first computing device can compute the similarity between every two data in the two data blocks.

S402-3, if the data volume in the first data group is smaller than or equal to a second preset value, sending the data in the first data group to second computing equipment for similarity calculation.

In the present application, the process of calculating the similarity of the data in each first data group is the same, and one first data group is taken as an example.

Whether the first data group is further divided is judged according to the data volume in the first data group, specifically, if the data volume in the first data group is larger than a preset data volume threshold value S in a data block, the data in the first data group is divided into F data blocks, and every two data blocks in the F data blocks are respectively sent to different first computing devices, so that the first computing devices can compute the similarity between every two data blocks in the two data blocks. And if the data volume in the first data group is less than or equal to the data volume threshold S, sending the data in the first data group to second computing equipment for similarity calculation.

In a possible implementation manner, the dividing, in the step S402-1, the data in the first data group into F data blocks according to the preset data amount threshold in the data block includes the following steps S402-11 and S402-12:

S402-11, adding 1 to the integer division result of the data volume in the first data group and the data volume threshold value to obtain the number of data blocks corresponding to the first data group;

s402-12, averagely dividing the data in the first data group into F data blocks according to the data amount in the first data group and the data block amount corresponding to the first data group.

For example, the number F of data blocks corresponding to the first data group is calculated according to the following formula (2):

F＝N1//S+1 (2)

where N1 is the data size in the first data set, S is the threshold data size in the predetermined data block, and "/" is the integer division.

And averagely dividing the data in the first data group into F data blocks according to the data quantity N1 in the first data group and the data block quantity F corresponding to the first data group. For example, as shown in fig. 8, the first data group 3 is divided into 3 data blocks.

At this time, the sending of two data blocks to the first computing device for every two data blocks in the F data blocks in the above S402-2 includes S402-21 and S402-22:

s402-21, obtaining R combinations of every two data blocks according to different combinations of every two data blocks in the F data blocks, wherein the R combinations of every two data blocks comprise the same two data block combinations and different two data block combinations in the F data blocks, and R is a positive integer;

S402-22, aiming at each data block combination in the R data block combinations, sending two data blocks in the data block combinations to the first computing equipment.

For example, assuming that the F data blocks include data block 1, data block 2, and data block 3, the R combinations of two data blocks include: (data block 1), (data block 1, data block 2), (data block 1, data block 3), (data block 2), (data block 3, data block 3). Sending (data block 1) to the first computing device 1, so that the first computing device 1 calculates the similarity between two data in the data block 1 and the data block 1; sending (data block 1, data block 2) to the first computing device 2, so that the first computing device 2 computes the similarity between two data in the data block 1 and the data block 2; sending (data block 1, data block 3) to the first computing device 3, so that the first computing device 3 computes the similarity between two data in the data block 1 and the data block 3; sending (data block 2, data block 3) to the first computing device 4, so that the first computing device 4 computes the similarity between two data in the data block 2 and the data block 3; sending (data block 2) to the first computing device 5, so that the first computing device 5 computes the similarity between two data blocks in the data block 2 and the data block 2; the (data block 3) is sent to the first computing device 6 so that the first computing device 6 computes the similarity between two data blocks in the data block 3 and the data block 3.

In a possible implementation manner, the above S402-21 includes: obtaining a data block matrix of FXF according to different combinations of every two data blocks in the F data blocks; and combining two data blocks positioned at the upper triangle in the data block matrix of FXF to determine the combination of R two data blocks. And aiming at each pairwise data block combination in the R pairwise data block combinations, sending two data blocks in the pairwise data block combinations to the first computing device.

For example, as shown in fig. 8, assuming that F is 3, the first data group is divided into 3 data blocks, and different combinations of two data blocks in the 3 data blocks obtain a 3X3 data block matrix, as shown in fig. 8, two data blocks located in an upper triangle in the 3X3 data block matrix are combined to determine as combinations of 6 two data blocks, which are respectively: (data block 1), (data block 1, data block 2), (data block 1, data block 3), (data block 2), (data block 3, data block 3), two data blocks in each two-by-two data block combination of 6 two-by-two data block combinations are distributed to different first computing devices for similar computation, for example, (data block 1) is sent to the first computing device 1, so that the first computing device 1 computes the similarity between two data blocks in the data block 1 and the data block 1; sending (data block 1, data block 2) to the first computing device 2, so that the first computing device 2 computes the similarity between two data in the data block 1 and the data block 2; sending (data block 1, data block 3) to the first computing device 3, so that the first computing device 3 computes the similarity between two data in the data block 1 and the data block 3; sending (data block 2, data block 3) to the first computing device 4, so that the first computing device 4 computes the similarity between two data in the data block 2 and the data block 3; sending (data block 2) to the first computing device 5, so that the first computing device 5 computes the similarity between two data blocks in the data block 2 and the data block 2; the (data block 3) is sent to the first computing device 6 so that the first computing device 6 computes the similarity between two data blocks in the data block 3 and the data block 3.

After the similarity calculation is finished according to the method, the steps from S302 to S305 are executed to calculate the connected components, so as to realize the classification of the data, and the specific process of calculating the connected components refers to the description from S302 to S304, which is not described herein again.

Example 3

Fig. 9 is a schematic flowchart of a data classification method according to another embodiment of the present application, and as shown in fig. 9, the method includes:

s701, acquiring N data to be classified.

Fig. 10 is a schematic view of an application scenario in an embodiment of the present application, which may be applied to a business name similarity model in a center, and by optimizing an offline model training scheme, offline training performance is greatly improved, and a big data calculation and calculation power problem is solved.

As shown in fig. 10, when a merchant conducts a transaction through the payment platform, registration needs to be completed first, that is, the merchant uploads registration information on the payment platform, where the registration information includes basic information such as a merchant name. The payment platform sends the registration information of the merchant to a wind control center, and the wind control center carries out strategy interception, for example, whether the merchant is a legal merchant or not is judged, whether the merchant is a registered merchant or not is judged, for example, the registration information of the merchant is matched in registered historical group partners, if the registration information of the merchant exists in the historical group partners, the merchant is registered, and repeated registration is not carried out. And if the merchant is not inquired in the historical group partner, determining that the merchant is a new merchant, and sending the registration information of the new merchant to the background computing equipment. Since a plurality of merchants register information with the payment platform, the background computing device can obtain the registration information of the plurality of merchants, for example, obtain the registration information of the N merchants, and execute the method of the embodiment of the present application by the background computing device to complete the classification of the merchants.

As can be seen from the above, in the application scenario shown in fig. 10, the registration information of N merchants can be understood as N data to be classified.

S702, performing word segmentation and part-of-speech tagging on each data in the N data to obtain a naming mode of each data.

And S703, dividing the data with the same naming mode in the N data into a first data group.

As can be seen from the above, the registration information of the merchant includes the merchant name, and the naming pattern of different types of merchant names may be different, based on which, the computing device obtains the naming pattern of each data (merchant name) by performing word segmentation and part-of-speech tagging on each data (merchant name) in the N data (merchant names, for example). The data with the same naming mode are divided into one first data group, and one or more first data groups are obtained.

S704, for each of at least one first data group, determining whether a data amount in the first data group is greater than a second preset value, if the data amount in the first data group is greater than the second preset value, executing S705 to S708, and if the data amount in the first data group is less than the second preset value, executing S709.

S705, dividing the data in the first data group into F data blocks according to a data amount threshold in a preset data block.

And S706, obtaining a data block matrix of FXF according to different combinations of every two data blocks in the F data blocks.

And S707, combining every two data blocks positioned at the upper triangle in the data block matrix of FXF to determine the combination of R every two data blocks.

And S708, aiming at each pairwise data block combination in the R pairwise data block combinations, sending two data blocks in the pairwise data block combinations to the first computing device, so that the first computing device computes the similarity between every two data blocks in the two data blocks.

For each first data group in at least one first data group, if the data amount in the first data group is greater than a second preset value, dividing the data in the first data group into F data blocks according to a preset data amount threshold S in a data block, for example, averagely dividing the data in the first data group into F data blocks. And according to different combinations of every two data blocks in the F data blocks, obtaining a data block matrix of FXF, and determining that every two data blocks positioned at an upper triangle in the data block matrix of FXF are combined into R combinations of every two data blocks. Aiming at each two-data block combination in the R two-data block combinations, two data blocks in the two-data block combinations are sent to the first computing device, so that the first computing device can compute the similarity between every two data in the two data blocks, the problem of data inclination is solved, the computing efficiency of the data similarity is improved, computing resources are balanced, and the waste of the computing resources is reduced.

And S709, if the data volume in the first data group is less than or equal to a second preset value, sending the data in the first data group to second computing equipment for similarity calculation.

And if the data volume in the first data group is smaller than a second preset value, the data in the first data group does not need to be divided, and the data volume in the first data group is directly sent to corresponding second computing equipment for similarity calculation.

After calculating the similarity between every two data of the N data according to the above S705 to S709, the following steps are performed to calculate the connected component based on the similarity.

S710, according to the similarity between every two data in the N data, determining the two data with the similarity larger than the first preset value as a relation pair to obtain M relation pairs.

S711, judging whether the ith relation pair has first data which is the same as the first communication component of the current pre-clustering or not in the ith relation pair in the arrangement sequence of the M relation pairs, if so, executing S712, and if not, executing S713, wherein i is a positive integer less than or equal to M.

And S712, connecting the ith relation pair with the first data in the first connected component.

S713, constructing a new first connected component by taking the data in the ith relation pair as the starting node of the new first connected component.

S714 determines whether i is equal to M, and if i is not equal to M, i equals i +1, and the steps S711 to S713 are continuously performed, and if i is equal to M, S715 is performed.

Through the steps from S710 to S714, the similarity between every two data is obtained, and P first connected components are obtained.

And S715, taking each first communication component in the P first communication components as a node, taking a connecting line of a relation pair between two different first communication components as an edge, and calculating the communication components to obtain Q second communication components.

For example, each of the P first connected components is taken as a node, and a connection line of a relationship pair between two different first connected components is taken as an edge, and the connection line is input into a connected component calculation model, and the connected component calculation model outputs Q second connected components.

And S716, determining the data in each second connected component in the Q second connected components as the same type of data.

In some embodiments, the method of calculating the connected component may be divided into a connected component calculation based on the similarity calculation and a connected component calculation based on the similarity result, depending on the size of the data volume within the first data group.

1. Connected component calculation based on similarity calculation

For similarity calculation with a small data amount in the first data group, single-node calculation may be adopted, and an exemplary connected component optimization calculation logic is as follows:

according to the logic codes, the similarity calculation with small data quantity in the first data group can adopt single-node calculation.

2. Connected component calculation based on similarity results

For the similarity calculation with a large data volume in the first data group, a Cartesian product can be adopted for distributed similarity calculation, and on the basis, the calculated similarity is used as the input of an algorithm for optimizing the connected components to calculate the connected components. Exemplary, the logic for connected component optimization is as follows:

as can be seen from the above logical codes, for the similarity calculation with a large amount of data in the first data group, the distributed similarity calculation may be performed by using the cartesian product.

The following are the beneficial effects produced by the technical scheme of the application.

Fig. 11 is a schematic diagram illustrating comparison between before and after optimization in similarity calculation based on spark, and as can be seen from fig. 11, the technical scheme of the embodiment of the present application is an optimized technical scheme. For example, for 30 ten thousand data volumes, 180 minutes is consumed for similarity calculation using the prior art, while 20 minutes is consumed for similarity calculation using the technical solution of the present application.

Fig. 12 is a schematic diagram illustrating comparison of effects before and after optimization of a connected component, and as can be seen from fig. 12, the technical scheme of the embodiment of the present application is an optimized technical scheme, and by using the technical scheme of the present application, when the connected component is calculated, the number of nodes and the number of edges can be reduced, so that the calculation time of the connected component is saved. For example, for the first set of experiments, the prior art needs to process 23408 nodes, 7560995 edges and takes 17.3 seconds(s) when performing the connected component calculation, whereas 9752 nodes, 548 edges and takes 32.1 milliseconds (ms) when performing the connected component calculation by using the technical solution of the present application.

The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example 4

Method embodiments of the present application are described in detail above with reference to fig. 3-12, and apparatus embodiments of the present application are described in detail below with reference to fig. 13-14.

Fig. 13 is a schematic structural diagram of a data classification apparatus according to an embodiment of the present application. The apparatus may be a computing device or may be a component of a computing device (e.g., an integrated circuit, a chip, etc.). As shown in fig. 13, the data sorting apparatus 10 may include:

an obtaining unit 11, configured to obtain N data to be classified;

a similarity determining unit 12, configured to determine a similarity between every two data in the N data;

a relation pair determining unit 13, configured to determine, according to a similarity between every two pieces of data in the N pieces of data, two pieces of data with a similarity greater than a first preset value as a relation pair, so as to obtain M relation pairs;

a pre-clustering unit 14, configured to pre-cluster the N data according to the association relationships and the arrangement order between the M relationship pairs, so as to obtain P first connected components;

A connected component determining unit 15, configured to obtain Q second connected components according to an association relationship between first connected components in the P first connected components;

a classification unit 16, configured to determine data in each of the Q second connected components as data of the same type;

wherein N, M, P, Q are all positive integers, and Q is less than or equal to P.

In some embodiments, the pre-clustering unit 14 is specifically configured to, for an ith relationship pair in the arrangement order of the M relationship pairs, if first data that is the same as a first communication component of the current pre-clustering exists in the ith relationship pair, connect the ith relationship pair with the first data in the first communication component; if the ith relation pair does not have first data which is the same as the first connected component of the current pre-clustering, constructing a new first connected component by taking the data in the ith relation pair as a starting node of the new first connected component, wherein i is a positive integer less than or equal to M.

In some embodiments, the connected component determining unit 15 is further configured to determine, according to a similarity between every two data of the N data, two data whose similarity is smaller than or equal to the first preset value as independent nodes, so as to obtain K independent nodes, where K is a positive integer; determining each of the independent nodes as a first connected component.

In some embodiments, the connected component determining unit 15 is specifically configured to, for each first connected component in the P first connected components, if there is a relationship pair between the first connected component and other first connected components, merge the first connected component and the other first connected components into one second connected component; and if no relation pair exists between the first communication component and other first communication components, determining the first communication component as a second communication component, wherein the other first communication components are first communication components except the first communication component in the P first communication components.

In some embodiments, the connected component determining unit 15 is specifically configured to take each of the P first connected components as a node, take a connection line of a relationship pair between two different first connected components as an edge, and perform connected component calculation to obtain the Q second connected components.

In some embodiments, the connected component determining unit 15 is specifically configured to use each first connected component in the P first connected components as a node, use a connection line of a relationship pair between two different first connected components as an edge, and use a generalized first search method or a depth first search method to obtain the Q second connected components.

In some embodiments, the similarity determining unit 12 is specifically configured to divide the N data into at least one first data group according to attribute information of each data in the N data; and sending the data of each first data group to different computing equipment for similarity calculation, and obtaining the similarity between every two data in each first data group from the different computing equipment.

In some embodiments, the similarity determining unit 12 is specifically configured to, for each first data group in the at least one first data group, if the data amount in the first data group is greater than a second preset value, divide the data in the first data group into F data blocks according to a data amount threshold in a preset data block, where F is a positive integer; for every two data blocks in the F data blocks, sending the two data blocks to a first computing device so that the first computing device calculates the similarity between every two data blocks in the two data blocks.

In some embodiments, the similarity determining unit 12 is specifically configured to add 1 to an integer division result of the data amount in the first data group and the data amount threshold, as the number of data blocks corresponding to the first data group; and averagely dividing the data in the first data group into F data blocks according to the data quantity in the first data group and the data block quantity corresponding to the first data group.

In some embodiments, the similarity determining unit 12 is specifically configured to obtain R combinations of two or more data blocks according to different combinations of two or more data blocks in the F data blocks, where the R combinations of two or more data blocks include the same two or more data blocks in the F data blocks and different two or more data blocks, and R is a positive integer; and aiming at each pairwise data block combination in the R pairwise data block combinations, sending two data blocks in the pairwise data block combinations to the first computing device.

In some embodiments, the similarity determining unit 12 is specifically configured to obtain FXF data block matrices according to different combinations of two data blocks in the F data blocks; and combining every two data blocks positioned at the upper triangle in the data block matrix of FXF to determine the combination of the R two data blocks.

In some embodiments, the similarity determining unit 12 is further configured to send the data in the first data group to a second computing device for similarity calculation if the data amount in the first data group is less than or equal to the second preset value.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus shown in fig. 13 may perform the embodiment of the method described above, and the foregoing and other operations and/or functions of each module in the apparatus are respectively for implementing the embodiment of the method corresponding to the encoder, and are not described herein again for brevity.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Example 5

Fig. 14 is a block diagram of a computing device according to an embodiment of the present application, where the computing device is configured to execute the data classification method according to the foregoing embodiment, and refer to the description in the foregoing method embodiment specifically.

The computing device 200 shown in fig. 14 includes a memory 201, a processor 202, and a communication interface 203. The memory 201, the processor 202 and the communication interface 203 are connected with each other in communication. For example, the memory 201, the processor 202, and the communication interface 203 may be connected by a network connection. Alternatively, the computing device 200 may also include a bus 204. The memory 201, the processor 202 and the communication interface 203 are connected to each other by a bus 204. Fig. 14 is a computing device 200 with a memory 201, a processor 202, and a communication interface 203 communicatively coupled to each other via a bus 204.

The Memory 201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 201 may store programs, and the processor 202 and the communication interface 203 are used to perform the above-described methods when the programs stored in the memory 201 are executed by the processor 202.

The processor 202 may be implemented as a general purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits.

The processor 202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 202. The processor 202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 202 reads the information in the memory 201 and completes the method of the embodiment of the application in combination with the hardware thereof.

The communication interface 203 enables communication between the computing device 200 and other devices or communication networks using transceiver modules such as, but not limited to, transceivers. For example, the data set may be acquired through the communication interface 203.

When computing device 200 includes bus 204, as described above, bus 204 may include a pathway to transfer information between various components of computing device 200 (e.g., memory 201, processor 202, communication interface 203).

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

In summary, the present disclosure is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data classification, comprising:

wherein N, M, P, Q are all positive integers, and Q is less than or equal to P.

2. The method according to claim 1, wherein the pre-clustering the N data according to the association relationship and the arrangement order between the M relationship pairs to obtain P first connection components comprises:

for the ith relation pair in the arrangement sequence of the M relation pairs, if first data which is the same as the first communication component of the current pre-clustering exists in the ith relation pair, connecting the ith relation pair with the first data in the first communication component, wherein i is a positive integer less than or equal to M;

If the ith relation pair does not have first data which is the same as the first communication component of the current pre-clustering, constructing a new first communication component by taking the data in the ith relation pair as an initial node of the new first communication component;

and if the two data in the ith relationship pair are respectively located in two different first communication components, determining the ith relationship pair as the relationship pair between the two different first communication components.

3. The method of claim 2, further comprising:

determining two data with the similarity smaller than or equal to the first preset value as independent nodes according to the similarity between every two data in the N data to obtain K independent nodes, wherein K is a positive integer;

determining each of the independent nodes as a first connected component.

4. The method according to any one of claims 1 to 3, wherein the obtaining Q second connected components according to the correlation between each first connected component in the P first connected components comprises:

for each first communication component in the P first communication components, if the first communication component and other first communication components have a relationship pair, combining the first communication component and other first communication components into a second communication component;

And if no relation pair exists between the first communication component and other first communication components, determining the first communication component as a second communication component, wherein the other first communication components are first communication components except the first communication component in the P first communication components.

5. The method according to any one of claims 1 to 3, wherein the obtaining Q second connected components according to the correlation between each first connected component in the P first connected components comprises:

and taking each first communication component in the P first communication components as a node, taking a connecting line of a relation pair between two different first communication components as an edge, and calculating the communication components to obtain the Q second communication components.

6. The method according to claim 5, wherein the performing connected component calculation by taking each of the P first connected components as a node and taking a connection line of a relationship pair between two different first connected components as an edge to obtain the Q second connected components comprises:

and taking each first communication component in the P first communication components as a node, taking a connecting line of a relation pair between two different first communication components as an edge, and obtaining the Q second communication components by using a generalized priority search method or a depth priority search method.

7. The method of claim 1, wherein determining the similarity between each two of the N data comprises:

dividing the N data into at least one first data group according to the attribute information of each data in the N data;

and sending the data of each first data group to different computing equipment for similarity calculation, and obtaining the similarity between every two data in each first data group from the different computing equipment.

8. The method of claim 7, wherein sending the data of each first data group to a different computing device for similarity calculation comprises:

for each first data group in the at least one first data group, if the data volume in the first data group is greater than a second preset value, dividing the data in the first data group into F data blocks according to a data volume threshold value in a preset data block, wherein F is a positive integer;

for every two data blocks in the F data blocks, sending the two data blocks to a first computing device so that the first computing device calculates the similarity between every two data blocks in the two data blocks.

9. The method of claim 8, wherein the dividing the data in the first data group into F data blocks according to a preset threshold of data amount in the data blocks comprises:

adding 1 to the result of the integral division of the data volume in the first data group and the data volume threshold value to obtain the number of data blocks corresponding to the first data group;

and averagely dividing the data in the first data group into F data blocks according to the data quantity in the first data group and the data block quantity corresponding to the first data group.

10. The method of claim 8, wherein sending two of the F data blocks to the first computing device for each two of the F data blocks comprises:

obtaining R combinations of every two data blocks according to different combinations of every two data blocks in the F data blocks, wherein the R combinations of every two data blocks comprise the same two data block combinations and different two data block combinations in the F data blocks, and R is a positive integer;

and aiming at each pairwise data block combination in the R pairwise data block combinations, sending two data blocks in the pairwise data block combinations to the first computing device.

11. The method of claim 10, wherein obtaining R pairwise combinations of data blocks from different combinations of pairwise data blocks of the F data blocks comprises:

obtaining a data block matrix of FXF according to different combinations of every two data blocks in the F data blocks;

and combining every two data blocks positioned at the upper triangle in the data block matrix of FXF to determine the combination of the R two data blocks.

12. The method of claim 8, further comprising:

and if the data volume in the first data group is less than or equal to the second preset value, sending the data in the first data group to second computing equipment for similarity calculation.

13. The method of claim 7, wherein if the attribute information of the data includes a naming mode, the dividing the N data into at least one first data group according to the attribute information of each of the N data comprises:

performing word segmentation and part-of-speech tagging on each data in the N data to obtain a naming mode of each data;

and dividing the data with the same naming mode in the N data into a first data group.

14. A data sorting apparatus, comprising:

an acquisition unit configured to acquire N data to be classified;

wherein N, M, P, Q are all positive integers, and Q is less than or equal to P.

15. A computing device, comprising: a memory, a processor;

the memory for storing a computer program;

The processor for executing the computer program to implement the method of any one of the preceding claims 1 to 13.