CN115563097A - Data processing method, data processing apparatus, storage medium, and program product - Google Patents

Data processing method, data processing apparatus, storage medium, and program product Download PDF

Info

Publication number
CN115563097A
CN115563097A CN202110749898.2A CN202110749898A CN115563097A CN 115563097 A CN115563097 A CN 115563097A CN 202110749898 A CN202110749898 A CN 202110749898A CN 115563097 A CN115563097 A CN 115563097A
Authority
CN
China
Prior art keywords
partition
data
dynamic index
partitions
updating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110749898.2A
Other languages
Chinese (zh)
Inventor
殷晖
陈秦星
胥皇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202110749898.2A priority Critical patent/CN115563097A/en
Priority to PCT/CN2022/095268 priority patent/WO2023273727A1/en
Publication of CN115563097A publication Critical patent/CN115563097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The application relates to a data processing method, a device, a storage medium and a program product, wherein the method is applied to a distributed system, the distributed system comprises a distribution guide node and a plurality of processing nodes, and the method comprises the following steps: the shunting guide node receives data, and selects a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index; the shunting guide node outputs the data to a processing node corresponding to the selected partition, wherein the data belonging to the same partition are used for clustering on the processing node; and the distribution guide node updates the dynamic index according to the data belonging to the selected partition. The data processing method can improve the probability that data belonging to the same community flows into the same node, reduce adverse effects caused by the 'cluster splitting problem' and improve the streaming clustering effect.

Description

Data processing method, data processing apparatus, storage medium, and program product
Technical Field
The present application relates to the field of data analysis, and in particular, to a data processing method, apparatus, storage medium, and program product.
Background
Clustering (clustering) is a technology for performing statistical analysis on data, and a conventional clustering algorithm is generally applied to processing static offline data, and the static data is divided into different communities by performing offline clustering analysis on the data, so that the data of the same community has similar attributes. In the fields of electronic commerce, finance and the like, offline analysis consumes a long time, the timeliness is not enough to support business targets, and community division is often needed to be carried out on data in a short time so as to take corresponding business measures for further processing different communities in time. That is, in the face of streaming data coming from a source, it is necessary to complete cluster analysis in real time, for example, in the second or minute level, which is streaming clustering.
In order to deal with mass data and ensure the calculation performance (high concurrency, low time delay and expandable), a distributed system is often adopted to carry services in the fields of e-commerce, risk control and the like, and the distributed system can support receiving mass real-time data, performing real-time service calculation aiming at the mass data and the like with high performance. Therefore, in order to ensure that the performance reaches the standard, the method can be supported by a distributed system in the scene of carrying out streaming clustering on mass data. In a distributed system, there may be multiple nodes with computing capabilities. All data are gathered in a single node for clustering analysis, firstly, the idea of distributed architecture is violated, and secondly, the single-node computing resources (such as a central processing unit, a memory and a magnetic disk) are not enough to support mass data for real-time clustering analysis. Then, on a distributed system, streaming clustering will face a "cluster cutting problem": data originally gathered in the same community (with similar attributes) flows to different nodes in a distributed system due to the structure of the distributed system, natural physical isolation causes that the data cannot be gathered in the same community and is divided into different communities, and the flow clustering effect is not ideal.
Therefore, how to improve the probability of data belonging to the same community flowing into the same node, reduce the adverse effect caused by the "cluster splitting problem", and improve the streaming clustering effect becomes a research focus in the field.
Disclosure of Invention
In view of this, a data processing method, an apparatus, a storage medium, and a program product are provided, and the data processing method according to the embodiment of the present application can improve the probability that data belonging to the same community flows into the same node, reduce adverse effects caused by the "cluster splitting problem", and improve the streaming clustering effect.
In a first aspect, an embodiment of the present application provides a data processing method, where the method is applied to a distributed system, where the distributed system includes a split guide node and multiple processing nodes, where the split guide node stores a dynamic index, the dynamic index includes multiple partitions, and each partition corresponds to one processing node, and the method includes: the shunting guide node receives data, and selects a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index; the shunting guide node outputs the data to a processing node corresponding to the selected partition, wherein the data belonging to the same partition are used for clustering on the processing node; and the distribution guide node updates the dynamic index according to the data belonging to the selected partition.
According to the data processing method of the embodiment of the application, the partition to which the data belongs can be selected from a plurality of partitions of the dynamic index through the parameters of the partitions in the dynamic index, so that when the data is output to the processing node corresponding to the selected partition according to the corresponding relation between the partitions and the processing node, the data already belonging to the selected partition exists on the processing node, and the data belonging to the same partition can be distributed to the same node; after the partition to which the data belongs is selected, the dynamic index can be updated according to the data belonging to the selected partition, so that when the subsequent received data is output to the processing node according to the updated dynamic index, the data distribution accuracy can be further improved, the problem that errors cannot be corrected in a fixed partition mechanism is solved, the probability that the data belonging to the same community flows into the same node is improved, the adverse effect caused by the cluster splitting problem is reduced, and the streaming clustering effect and the adaptability to different application scenes are improved.
In a first possible implementation form of the data processing method according to the first aspect, the parameters of each partition include a kernel and an affinity, the kernel indicating a center position of the partition, and the affinity indicating a degree of affinity between data belonging to the partition.
By the method, the partition parameters can be used for reflecting the relevant information of the data of the partition, and the accuracy of selecting the partition to which the data belongs according to the partition parameters can be ensured.
In a second possible implementation manner of the data processing method according to the first possible implementation manner of the first aspect, the selecting, from the plurality of partitions, a partition to which the data belongs according to a parameter of each partition in the dynamic index includes: determining a proximity of the data to a core of each of the plurality of partitions; and selecting one partition with the core closest to the data as a partition to which the data belongs.
In this way, the partition attribution of the received data may be determined. The data is closer to the core of the partition, so that the accuracy of the data belonging to the partition is higher, the mode of determining the proximity degree of the partition core and the data is simpler, and the complexity of selecting the partition to which the data belongs can be reduced.
In a third possible implementation manner of the data processing method according to the first aspect as well as any one of the above possible implementation manners of the first aspect, the updating the dynamic index according to the data belonging to the selected partition includes: updating the core and the affinity of the selected partition according to data attributed to the selected partition.
By the method, the parameters of the partition can be updated on the premise of not changing the partition mode, so that the updating mode of the dynamic index is simpler and more convenient. After the partition to which the data belongs is selected, the data belonging to the selected partition is changed, and the core and the compactness of the selected partition are updated according to the changed data, so that when new data is subsequently received, the partition to which the new data belongs can be selected according to the updated core and the compactness, that is, whether the new data has a strong proximity to the selected partition or not is judged according to the updated core and the compactness, and the accuracy of the data partition is improved.
According to the first aspect and any one of the first and second possible implementation manners of the first aspect, in a fourth possible implementation manner of the data processing method, the updating the dynamic index according to the data belonging to the selected partition includes: updating the affinity of said selected partition according to data belonging to said selected partition; determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index; and updating the core and the compactness of each partition in the dynamic index according to the data belonging to each partition.
The partition mode is changed through partition splitting, so that partition parameters are changed accordingly, and updating of the dynamic index can be completed. Because the partition splitting is performed when the compactness does not meet the threshold condition, the compactness of the partition of the obtained dynamic index is better after the dynamic index is updated in a splitting mode, and the accuracy of selecting the partition to which the data belongs according to the updated dynamic index is higher.
According to a fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the data processing method, the updating the dynamic index according to the data belonging to the selected partition further includes: determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to be merged; updating the core and the compactness of the merged partition according to the data belonging to the merged partition.
In this way, the partition mode and the partition parameters are changed, and the update of the dynamic index can be completed. The partition merging is performed when the total number of the partitions does not meet the threshold condition, so that the number of the partitions of the dynamic index obtained after the dynamic index is updated in a merging mode is more optimal, the total number of the partitions is related to the number of partition parameters in the dynamic index, and when the number of the partitions is more optimal, the time cost and the operation cost for selecting the partition to which the data belongs are also more optimal according to the parameters of the partitions in the dynamic index, so that when the partition to which the data belongs is selected according to the updated dynamic index, higher calculation efficiency can be achieved on the premise of ensuring high enough accuracy.
According to a fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the data processing method, selecting multiple partitions in the dynamic index to merge includes: determining the proximity of every two partitions in the dynamic index; the two partitions closest are selected to be merged into one partition.
In this way, partition merging may be accomplished such that the total number of partitions after merging is reduced to more closely approximate the requirements of the threshold condition.
According to the first aspect and any one of the above possible implementations of the first aspect, in a seventh possible implementation of the data processing method, the method further includes: determining whether data attributed to the dynamically indexed partition is time-sensitive; and when part or all of the data lose timeliness, deleting the dynamic index, or updating the dynamic index according to the data with timeliness.
By the method, adverse effects of old data on the dynamic index can be removed, and the accuracy of selecting the partition to which the data belongs according to the dynamic index is further improved.
In a second aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus is applied in a distributed system, where the distributed system includes a offload guidance node and a plurality of processing nodes, where the offload guidance node stores a dynamic index, the dynamic index includes a plurality of partitions, and each partition corresponds to one processing node, the apparatus includes:
the partition determining module is arranged on the shunting guide node and used for receiving data and selecting a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index;
the data output module is arranged on the shunting guide node and used for outputting the data to the processing nodes corresponding to the selected partitions, wherein the data belonging to the same partition are used for clustering on the processing nodes;
and the index updating module is arranged on the flow distribution guide node and used for updating the dynamic index according to the data belonging to the selected partition.
In a first possible implementation form of the data processing apparatus according to the second aspect, the parameters of each partition include a kernel indicating a center position of the partition and a closeness indicating a degree of closeness between data belonging to the partition.
In a second possible implementation manner of the data processing apparatus according to the first possible implementation manner of the second aspect, the selecting, from the plurality of partitions, a partition to which the data belongs according to a parameter of each partition in the dynamic index includes: determining a proximity of the data to a core of each of the plurality of partitions; and selecting one partition with the core closest to the data as a partition to which the data belongs.
In a third possible implementation form of the data processing apparatus according to the second aspect as well as any one of the above possible implementation forms of the second aspect, the updating the dynamic index according to the data belonging to the selected partition includes: updating the core and the compactness of said selected partition according to data attributed to said selected partition.
According to the second aspect and any one of the first and second possible implementation manners of the second aspect, in a fourth possible implementation manner of the data processing apparatus, the updating the dynamic index according to the data belonging to the selected partition includes: updating the affinity of said selected partition according to data belonging to said selected partition; determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index; and updating the core and the compactness of each partition according to the data belonging to each partition in the dynamic index.
In a fifth possible implementation manner of the data processing apparatus according to the fourth possible implementation manner of the second aspect, the updating the dynamic index according to the data that belongs to the selected partition further includes: determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to be merged; updating the core and the affinity of the merged partition according to the data belonging to the merged partition.
According to a fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the data processing apparatus, selecting a plurality of partitions in the dynamic index to merge includes: determining the proximity of every two partitions in the dynamic index; the two partitions closest are selected to be merged into one partition.
In a seventh possible implementation form of the data processing apparatus according to the second aspect as well as any one of the above possible implementation forms of the second aspect, the apparatus further comprises:
the time effectiveness determining module is arranged on the shunting guidance node and used for determining whether the data of the partition belonging to the dynamic index has time effectiveness or not;
and the index processing module is arranged on the shunting guide node and used for deleting the dynamic index when partial or all data lose timeliness or updating the dynamic index according to the data with timeliness.
In a third aspect, an embodiment of the present application provides a data processing apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to be able to perform the data processing method of the first aspect or one or several of the many possible implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium, on which computer program instructions are stored, where the computer program instructions, when executed by a processor, implement the data processing method of the first aspect or one or more of the many possible implementation manners of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product, which includes computer readable code or a non-transitory computer readable storage medium carrying computer readable code, and when the computer readable code runs in an electronic device, a processor in the electronic device executes a data processing method of one or more of the first aspect or the multiple possible implementation manners of the first aspect.
These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.
Fig. 1 shows a schematic diagram of a clustering method of the first prior art.
Fig. 2 shows a schematic diagram of a clustering method of the prior art two.
Fig. 3 shows a schematic diagram of a clustering method of the prior art three.
Fig. 4 illustrates an exemplary application scenario according to an embodiment of the present application.
Fig. 5 illustrates an example of a data processing method according to an embodiment of the present application.
FIG. 6 illustrates an example of dynamic indexing according to an embodiment of the present application.
FIG. 7 illustrates an example of the degree of association of data with various partitions in a dynamic index according to an embodiment of the present application.
FIG. 8 illustrates an example of updating the core and the compactness of a selected partition according to an embodiment of the present application.
Fig. 9 shows an example of a partition splitting manner according to an embodiment of the present application.
Fig. 10 shows an example of euclidean distances between cores of every two partitions after partition splitting according to an embodiment of the present application.
FIG. 11 illustrates an example of partition consolidation according to an embodiment of the present application.
Fig. 12 illustrates an example of a data processing method according to an embodiment of the present application.
Fig. 13 illustrates an example of a time window according to an embodiment of the application.
Fig. 14 shows an exemplary flowchart of a data processing method according to an embodiment of the present application.
Fig. 15 shows an exemplary structural diagram of a data processing apparatus according to an embodiment of the present application.
Fig. 16 shows an exemplary structural diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.
The definitions of the terms appearing herein are given below.
Clustering: also known as cluster analysis, is a technique for statistical data analysis, and is widely used in the fields of machine learning, data mining, pattern recognition, image analysis, and biological information. Clustering is to divide objects with similar attributes into different groups or subsets (subset) by means of static classification, so that all the object members in the same group/subset have similar attributes, which are usually included in a shorter spatial distance in a coordinate system, and the like, and the group or the subset can be used as a community. Clustering is generally generalized to unsupervised learning, data has no label, and engineering is generally presented as a batch statistical analysis process of offline data.
Flow type clustering: in order to improve the efficiency of extracting and mining knowledge from data and ensure the timeliness of data knowledge, the clustering method is applied to real-time streaming data, and clustering analysis is completed in the aspect of continuously coming data in a second-level or minute-level timeliness manner.
Several solutions to the "cluster splitting problem" in the prior art are described below.
Fig. 1 shows a schematic diagram of a clustering method of the first prior art. As shown in fig. 1, the streaming data may include a user group, where users of the user group have similar attributes, and a prior art proposes that the streaming data randomly flows to different nodes of a distributed system, and local clustering is performed on the different nodes, and after all the nodes are clustered, data (one or more users) on each node may be considered to correspond to one or more communities, and on this basis, "secondary global aggregation" is performed, that is, similar communities between nodes are clustered to obtain one community including the user group, so as to avoid that the user group is physically isolated by the different nodes.
The first prior art has the disadvantages that the second-time global aggregation needs to wait for the last distributed node to complete the local clustering, and the waiting time depends on the clustering time consumption of the node which completes the local clustering at the latest, so that the first prior art has a barrel short plate effect on the time performance, the real-time performance of the streaming clustering is reduced, the data volume to be processed by the second-time global aggregation is very large, the computation time consumption is greatly increased, and the computation efficiency of the streaming clustering is reduced. In addition, the secondary global aggregation needs complex application skills, and secondary clustering needs to be performed based on community characteristics after local clustering of the nodes, so that the application complexity of the streaming clustering is improved.
Fig. 2 shows a schematic diagram of a clustering method of the prior art two. As shown in fig. 2, the streaming data may include a user group, where users of the user group have similar attributes, and a second prior art proposes that "pre-data partitioning" is performed before the streaming data flows to different nodes of the distributed system, and partition strategies such as Local Sensitive Hashing (LSH) and Hash (Hash) are adopted, so that the user group belongs to the same partition as much as possible, data belonging to the same partition flows to the same node, and data of different nodes are locally clustered respectively, thereby reducing the probability that data of the same community is physically isolated by different nodes.
The second prior art has a disadvantage that a partitioning mechanism of a policy such as a local sensitivity hashing algorithm, a hashing algorithm, etc. is fixed and unchangeable, that is, if the partitioning result of a certain user is wrong by using the mechanism, the user cannot belong to the same partition with other users in a user group, and further cannot flow to the same node with other users, and then when another user with the same attribute is partitioned next time, the partitioning result still has a mistake, so that the clustering effect completely depends on the fixed and unchangeable partitioning mechanism, and the probability that the user group is physically isolated by different nodes is still higher.
Fig. 3 shows a schematic diagram of a clustering method of the prior art three. As shown in fig. 3, the third prior art combines the solutions of the first prior art and the second prior art, but still fails to remedy the disadvantages of the first prior art and the second prior art.
In order to solve the technical problem, the data processing method, the data processing device, the storage medium and the program product are provided, and the data processing method according to the embodiment of the present application can improve the probability that data belonging to the same community flows into the same node, reduce adverse effects caused by the "cluster splitting problem", and improve the streaming clustering effect.
The data processing method according to the embodiment of the present application may be applied to a distributed system, and fig. 4 illustrates an exemplary application scenario according to the embodiment of the present application. The receiving device and the plurality of cluster nodes (1 st to N th cluster nodes) are arranged in the distributed system, the receiving device can be used as a main device of the distributed system, and the plurality of cluster nodes can be used as a plurality of sub devices of the distributed system. The receiving device, the sending device, the cluster node may be any type of device including, but not limited to, a smart phone, a personal computer, a tablet computer, etc. The receiving device may store a dynamic index, where the dynamic index may include a plurality of index partitions, and the dynamic index may indicate different index partitions and attribute information corresponding to each index partition. The dynamic index can be used for determining an index partition to which the received data belongs and enabling the data belonging to the same index partition to enter the same sub-device for clustering, wherein the dynamic index can be updated along with the received data so as to improve the probability that the data belonging to the same community flows into the same node. The manner in which the dynamic index is created and updated can be seen later.
Table 1 shows an example of a data structure of a dynamic index according to an embodiment of the present application.
TABLE 1
Attribute M1;D1 M2;D2 MQ;DQ
Partitioning 1 2 Q
As shown in Table 1, the dynamic index includes index partitions and attributes, one index partition corresponds to only one cluster node, and one cluster node may correspond to multiple partitions. For example, when there are N cluster nodes (N ≧ 1 and integer), and the maximum number of index partitions is W (W ≧ N and integer), partition 1 may represent the first index partition, which may be set to correspond to the 1 st cluster node, for example; partition 2 may represent a second index partition, which may be set, for example, to correspond to cluster node 2; partition Q may represent a Qth index partition, where N ≦ Q ≦ W and is an integer, and may be set, for example, to correspond to the Nth cluster node.
The attributes may include, for example, a kernel M that describes the "center position" of the index partition, and an affinity D that describes the degree of aggregation or closeness of all data within the index partition. For example, data belonging to a certain partition, when the shape in the coordinate system is similar to a two-dimensional circle, the center of the circle may indicate the core M, for example, and the radius of the circle may indicate the tightness D, for example, under the definition, the value of tightness is inversely related to the tightness degree, for example, when the number of data is unchanged, the smaller the radius (the smaller the tightness value is), the greater the aggregation degree of all data in the index partition, the closer the data in the index partition are, and the data are more closely related to each other, for example, the data are closer to each other; the larger the radius (the larger the closeness value), the less the aggregation of all the data within the index partition, and the less closely the data within the index partition, meaning that the data are less closely related to each other, e.g., are further away from each other. Each index partition includes a set of combinations of cores and compactness, such as the combination of M1, D1, the combination of M2, D2, and the combination of MQ, DQ shown in Table 1.
In the application scenario of fig. 4, for an index partition composed of several pieces of data, one possible definition of the kernel M is a value obtained by averaging dimensions of each piece of data, and one possible definition of the closeness D is an average of distances from each piece of data to the kernel M.
Specifically, assuming that the partition 1 includes two pieces of data, each piece of data includes two dimensions a and B, in a two-dimensional coordinate system with the dimensions a and B as coordinate axes, the position of the 1 st piece of data may be, for example, (a 1, B1), and the position of the 2 nd piece of data may be, for example, (a 2, B2), then the average value a of the dimension a of each piece of data = (a 1+ a 2)/2, the average value B of the dimension B of each piece of data = (B1 + B2)/2, and the core M1 of the partition 1 may be, for example, M1= (a, B). The distance of the 1 st piece of data to the core M1 may be, for example
Figure BDA0003145712960000071
The distance of the 2 nd data to the core M1 may be, for example
Figure BDA0003145712960000072
The compactness D1 of partition 1 may be, for example, (D1 + D2)/2. Under this definition, the closeness value of a partition is positively correlated with the sum of distances of all data belonging to the partition, and negatively correlated with the closeness of the partition, i.e., a smaller value of closeness indicates a tighter data space between the data within the partition.
It should be understood by those skilled in the art that the above core and compactness are only an example of partition properties in a dynamic index, and the core M, the compactness D, the distance between data and an index partition, etc. can be expressed in a specific quantitative manner by other means, for example, the core M can also be determined according to a weighted average of the position of each data, etc., and the present application is not limited thereto.
As shown in fig. 4, in one exemplary application scenario, a sending device may send streaming data to a receiving device. The receiving device may be configured to receive streaming data, and when each piece of data is received, may calculate association degrees of the data with different partitions of the dynamic index, and output the data to a cluster node corresponding to a partition with the strongest association degree, where a calculation manner of the association degree may refer to fig. 7 and related description below.
After determining the partition with the strongest association degree with the data, the piece of data can be attributed to the partition with the strongest association degree. The dynamic index is updated as the data for the partition changes. The dynamic index may be updated as described below with reference to fig. 8, 9, and 11. The receiving device may continue to determine the most strongly associated partition with the next piece of data based on the most recent dynamic index. So that the next piece of data can be output to the corresponding cluster node. By analogy, each piece of data in the streaming data can be output to a cluster node corresponding to the partition with the strongest association degree.
And performing local clustering on the streaming data on each cluster node. If a certain cluster node corresponds to a plurality of partitions, the data belonging to different partitions on the cluster node are locally clustered respectively. The clustering algorithm used by each node for local clustering may be the same as the clustering algorithm corresponding to the aggregation measurement when the index partition in the dynamic index is generated.
The receiving device and the cluster node may be different computing units disposed on the same computer, that is, one computer may include the receiving device and/or at least one cluster node. The receiving device may also implement the functionality of a cluster node when its own computing power is sufficient. The setting mode of the receiving device and the cluster node is not limited.
An exemplary workflow of the data processing method according to the embodiment of the present application is described below with reference to the above exemplary application scenario. The workflow of the data processing method may be different according to whether the dynamic index exists on the receiving device. An example of a data processing method according to an embodiment of the present application when a dynamic index exists on a receiving device is described below with reference to fig. 5 to 11.
Fig. 5 illustrates an example of a data processing method according to an embodiment of the present application. As shown in fig. 5, in a possible implementation manner, a data processing method of the embodiment of the present application is executed by a receiving device, and includes:
s21, receiving data from the transmitting device.
For example, the data may be, for example, a piece of data in streaming data, the streaming data may be, for example, data such as video, audio, or images, text, etc., a piece of data received may be one or more videos, one or more audio frames, a piece or group of images, etc. The sending device may, for example, transmit streaming data to the receiving device.
S22, determining whether a dynamic index already exists on the receiving device.
And S23, under the condition that the dynamic index exists, selecting the partition to which the data belongs according to the dynamic index.
For example, the data received in step S21 may be, for example, the xth data received after the streaming data is received, before which, according to the 1 st data to the xth-1 st data, the receiving device has obtained a dynamic index (for example, refer to fig. 12, fig. 5 and the related description thereof, an example manner of obtaining the dynamic index according to the xth data to the xth-1 st data may be shown), and the dynamic index may include, for example, 4 partitions. FIG. 6 illustrates an example of dynamic indexing according to an embodiment of the present application.
For example, the partition to which the data belongs may be a partition with the strongest association with the data, and the association degree between the data and the partition may be determined through aggregation measurement. The algorithm of the aggregation measurement may be, for example, the same as the algorithm used for local clustering on each cluster node. By the method, data which are to be clustered into the same community can be attributed to the same partition roughly, so that the data enter the same cluster node, and the problem of cluster splitting caused by the fact that the data which are to be clustered into the same community enter different cluster nodes is avoided.
In one possible implementation, the attributes of the respective partitions of the dynamic index include, for example, the kernel and the closeness, and the aggregation measure may determine the proximity of the data (the xth piece of data) received in step S21 to the kernel of each partition. The degree of closeness can be used as a quantitative expression of the "degree of association". For example, the Euclidean distance between the position of the X-th data and the kernel in the two-dimensional coordinate system is determined, or the Doppler distance between the X-th data and the kernel is determined, and the like. The aggregate metric may, for example, yield a number of distance metrics equal to the number of partitions of the dynamic index. The result value indicates the degree of association of each partition in the X-th piece of data and the dynamic index. That is, the larger the value is, the greater the distance between the position of the xth data and the core of the current partition is, and the weaker the association degree between the xth data and the current partition is; the smaller the value is, the smaller the distance between the position of the X-th data and the core of the current partition is, and the stronger the association degree between the X-th data and the current partition is.
FIG. 7 illustrates an example of the degree of association of data with various partitions in a dynamic index according to embodiments of the application. In the example of fig. 7, among the plurality of distance measurement results, the measurement result with the smallest value is obtained when the position of the xth data and the core M1 of the partition 1 are measured in an aggregation manner, and indicates that the correlation degree between the xth data and the partition 1 is the strongest when the distance between the position of the xth data and the core M1 of the partition 1 is the smallest. In this case, partition 1 may be the partition to which the selected data belongs. If there are a plurality of measurement results with the smallest numerical value, one of a plurality of partitions corresponding to the core that performs the aggregation measurement may be selected as the partition to which the selected xth piece of data belongs when the plurality of measurement results are obtained.
And S24, outputting the data to the cluster node corresponding to the selected partition according to the corresponding relation between the selected partition and the cluster node.
For example, partition 1 may, for example, correspond to a1 st cluster node, partition 2 may, for example, correspond to a2 nd cluster node, partition 3 may, for example, correspond to a 3 rd cluster node, and partition 4 may, for example, also correspond to a 3 rd cluster node. When the partition to which the selected data belongs is partition 1, the reception apparatus may output the data received in step S21 to the 1 st cluster node.
And S25, updating the partition attribute in the dynamic index according to the data, or updating the partition mode and the partition attribute in the dynamic index.
In one possible implementation, after the partition to which the data belongs is selected, in step S25, the core and the compactness of the selected partition in the dynamic index may be updated.
FIG. 8 illustrates an example of updating the core and the compactness of a selected partition according to an embodiment of the present application. Taking the application scenario of fig. 8 as an example, wherein the data attributed to the multiple partitions of the dynamic index may be, for example, two-dimensional data with dimensions a and b, before the X-th piece of data arrives, several pieces of data (e.g., F pieces) may already be attributed to the partition 1, and the core M1 and the compactness D1 of the partition are determined according to the F pieces of data. After the partition (partition 1) to which the X-th piece of data belongs is selected, the number of pieces of data belonging to the partition 1 is increased to F +1, in which case the core M1 and the compactness D1 of the partition 1 may be changed. Based on the F +1 pieces of data, the core M1 'and the tightness D1' of partition 1 may be recalculated. The specific calculation method can refer to the corresponding description of fig. 4 above.
After the computation of the core M1 'and the compactness D1' is completed, the dynamic index may be updated based on the core M1 'and the compactness D1'. The updated dynamic index has the same number of partitions as the original dynamic index. In the updated dynamic index, the core and the compactness corresponding to the partition 1 are the core M1 'and the compactness D1' obtained by calculation after the Xth data is assigned to the partition 1, and the data assigned to other partitions (partitions 2, 3 and 4) are unchanged except for the partition 1, so that the core and the compactness of other partitions are the same as those of the corresponding partition in the original dynamic index.
However, if the difference between the data and other data in the selected partition is large, for example, the data in the selected partition is far away from other data in the two-dimensional coordinate system, the data compactness of the partition may be reduced, so that after the partition to which the data belongs is selected, the compactness of the selected partition becomes large and the compactness becomes small, and when the compactness of the partition is large, the accuracy of completing the data splitting by means of the dynamic index may be affected.
The embodiment of the application further provides a threshold condition for setting the attribute, judges the attribute of the partition to which the selected data belongs, and further processes the dynamic index when the attribute does not meet the requirement of the threshold condition, so as to ensure the accuracy of completing the data distribution by means of the dynamic index.
In one possible implementation, after the partition to which the data belongs is selected, in step S25, the closeness of the selected partition in the dynamic index may be updated.
The way of updating the compactness of the selected partition can refer to the example of fig. 8, wherein the data of the plurality of partitions attributed to the dynamic index can be, for example, two-dimensional data with dimensions a and b, before the X-th piece of data arrives, several pieces of data (for example, F pieces) may already be attributed to the partition 1, and the core M1 and the compactness D1 of the partition are determined according to the F pieces of data. After the partition (partition 1) to which the X-th piece of data belongs is selected, the number of pieces of data belonging to the partition 1 is increased to F +1, in which case the core M1 and the compactness D1 of the partition 1 may be changed. Based on the F +1 pieces of data, the closeness D1' of partition 1 may be recalculated. The specific calculation method can refer to the corresponding description of fig. 4 above.
In a possible implementation manner, after the computation of the compactness D1' is completed, in step S25, it further determines the compactness value, according to the definition of the compactness in the related description of fig. 4, the smaller the compactness value is, the tighter the data in the partition is, and when the computed compactness is less than or equal to the first threshold, it is considered that the splitting of the data completed according to the current dynamic index is higher in accuracy, and the partition manner of the dynamic index does not need to be changed. In this case, the core of the partition to which the selected data belongs may be further updated to complete the update of the dynamic index.
For example, the first threshold U may be preset, and when the closeness is less than or equal to the first threshold U, it may be considered that after the xth data is assigned to partition 1, all data assigned to the partition is close enough, and the association degree between the data is strong enough. In this case, it is considered feasible to attribute the X-th piece of data to partition 1, and the current dynamic index is highly accurate. After the partition to which the data belongs is selected, the core M1' of the partition may be calculated according to the data belonging to the selected partition, and the calculation manner may refer to the example of fig. 8 and the related description of fig. 4. The compactness D1 'of the selected partition is already calculated, so that after the calculation of the core M1' is completed, the dynamic index can be updated based on the core M1 'and the compactness D1'. The updated dynamic index has the same number of partitions as the original dynamic index. In the updated dynamic index, the core and the compactness corresponding to the partition 1 are the core M1 'and the compactness D1' obtained by calculation after the Xth data is assigned to the partition 1, and the data assigned to other partitions (partitions 2, 3 and 4) are unchanged except for the partition 1, so that the core and the compactness of other partitions are the same as those of the corresponding partition in the original dynamic index.
In a possible implementation manner, after the determination on the value of the compactness is completed and the compactness is determined to be greater than the first threshold U, the current dynamic index is considered to be of lower accuracy. In step S25, steps S31 to S32 below are further included to process the dynamic index and update the partition mode and partition attributes in the dynamic index.
And S31, splitting the selected partition in the dynamic index into a plurality of partitions.
Taking splitting into two partitions as an example, fig. 9 shows an example of a partition splitting manner according to an embodiment of the present application. As shown in fig. 9, when the selected partition is partition 1, any two pieces of data may be used as possible partition cores according to the F pieces of data and the X-th piece of data in the selected partition (partition 1). The two partition cores are firstly respectively allocated to two partitions (for example, the partition 1-1 and the partition 1-2), and then the rest data are respectively allocated to the two partitions, and the partition cores are changed along with the increase of the data quantity in the two partitions. Until the distribution of all the F pieces of data and the X piece of data is completed. The partition splitting may be implemented according to the prior art, and the partition splitting manner is not limited in the present application, for example, the partition splitting manner is split into two partitions, or into more than two partitions, and the like.
And S32, respectively calculating the core and the compactness of each divided partition, and updating the dynamic index.
The manner in which the kernel and the compactness are calculated can be referred to in connection with the description of FIG. 4 above. After the core and affinity calculations are completed, the dynamic index may be updated based on the core and affinity. Taking the example that the selected partition (partition 1) is split into 2 partitions (partition 1-1 and partition 1-2), the number of the updated partitions of the dynamic index is increased by one than the number of the partitions in the original dynamic index. In this case, the partition 1-1 obtained after the partition 1 of the dynamic index is split may be used as the partition 1 in the updated dynamic index; the partitions 2, 3, and 4 of the dynamic index can be respectively used as the partitions 2, 3, and 4 in the updated dynamic index; the partition 1-2 resulting from the splitting of partition 1 of the dynamic index may serve as partition 5 in the updated dynamic index. The core and the compactness corresponding to the partition 1 are the core and the compactness of the partition 1-1 obtained through calculation, the core and the compactness corresponding to the partition 5 are the core and the compactness of the partition 1-2 obtained through calculation, and data belonging to other partitions (the partitions 2, 3 and 4) are unchanged except for the partition 1 and the partition 5, so that the core and the compactness of the other partitions are the same as those of the corresponding partitions in the original dynamic index.
The partition 5 is a new partition of the dynamic index, and the correspondence between the partition 5 and the cluster node may be preset, for example, the 5 th partition (i.e., the partition 5) appearing in time sequence in the dynamic index is preset to correspond to a specific cluster node, for example, the 3 rd cluster node. The partition 5 and the cluster node may correspond to each other in a manner that a cluster node with a smaller number of corresponding partitions is preferentially selected, for example, if the 3 rd cluster node does not have a partition corresponding to the cluster node, the 3 rd cluster node may be preferentially selected as the cluster node corresponding to the partition 5. Those skilled in the art should understand that there are many setting ways for the corresponding way of the partition and the cluster node, and the application does not limit this.
After the receiving device selects the partition to which the data belongs according to a certain piece of data, the data is distributed to the cluster nodes corresponding to the partition, and the data of the same partition is locally clustered once on the cluster nodes, that is, the number of times of local clustering is associated with the total number of the partitions. If the number of the partitions is too large, the clustering frequency on the cluster nodes is greatly increased, and the clustering frequency may exceed the computing capability of the cluster nodes to influence the clustering effect.
The embodiment of the application further provides a threshold condition for setting the number of the partitions, judges the total number of the partitions after the partitions are split, and further processes the dynamic index when the total number of the partitions does not meet the requirement of the threshold condition, so that the number of the partitions is reduced, and the clustering effect is ensured.
In a possible implementation manner, after the partition splitting and the dynamic index updating are completed in steps S31-32, in step S25, the number of the partitions after the partition splitting is further determined, and when the total number of the partitions after the partition splitting does not exceed the partition number threshold, it is considered that after the data splitting is completed according to the partition manner of the current dynamic index, the cluster node can complete the clustering with a better effect, and no further processing is needed for the dynamic index.
For example, the partition number threshold W may be preset according to the computing capability of the cluster node, for example, W =5 is set, before the partition splitting is completed in step S31, the number of partitions may be, for example, 4, and after the partition splitting, the total number of partitions is increased by one (5), in this case, the number of partitions is less than or equal to the partition number threshold W, it may be considered that the partition manner after the splitting is feasible, and the partition manner of the current dynamic index meets the requirement. In addition, in the partition mode of the current dynamic index, the dynamic index is already updated (step S32), so that the current dynamic index does not need to be processed again before the next piece of data is received.
In a possible implementation manner, after the determination on the total partition number after partition splitting is completed and it is determined that the total partition number is greater than the partition number threshold W, for example, when the partition number threshold W is set to 4, before partition splitting is completed in step S31, the total partition number may be, for example, 4, and after partition splitting, the total partition number is increased by one (5), in which case, the total partition number in the dynamic index is greater than the partition number threshold W, it may be considered that the current partition manner is not feasible, and the current dynamic index is less efficient. Step S25 further includes the following steps S33 to S35 to process the dynamic index and update the partition mode and the partition attribute in the dynamic index.
And S33, determining the proximity degree of every two divided areas after the division according to the cores of all divided areas of the dynamic index after the division.
The proximity may be, for example, a euclidean distance between kernels, a doppler distance, or the like. The proximity may also be, for example, a similarity between two partitions, such as a magnitude of a difference in the same dimension, or a characteristic common to both partitions. The calculation of the proximity may be implemented based on the prior art, and fig. 10 shows an example of the euclidean distance between the cores of each two partitions after partition splitting according to the embodiment of the present application.
It should be understood by those skilled in the art that the proximity degree is not limited to this, as long as the proximity degree of the two partitions can be determined by the related information of the two partitions, including but not limited to the attributes of the core, and the data belonging to the two partitions, and the determination manner of the proximity degree is not limited in the present application.
And S34, merging the two closest partitions into one partition.
For example, the closest proximity between partitions may be the Euclidean distance closest to the core of each partition. FIG. 11 illustrates an example of partition consolidation according to an embodiment of the present application. As shown in fig. 11, taking splitting into two partitions as an example, the dynamic index before splitting includes partitions 1, 2, 3, and 4, and the dynamic index after splitting includes partitions 1, 2, 3, 4, and 5, after calculation, it is found that the euclidean distance between the cores of partition 2 and partition 3 is the smallest, that is, the closest distance between partition 2 and partition 3, in which case, partition 2 and partition 3 may be merged. The original partition 2 and the original partition 3 are merged together to be, for example, a partition 2. The original partitions 1 and 4 are still used as the partitions 1 and 4, the original partition 5 is used as the partition 3, and in this case, the total number of the partitions is equal to the threshold of the number of the partitions, so that the merged partition mode is considered to be feasible.
S35, calculating the core and the compactness of the merged partition, and updating the dynamic index.
For example, after the merging is completed, the core and the compactness of the merged partition may be calculated, and the calculated core and compactness may be used to update the attributes of the corresponding partition (partition 2) in the dynamic index. The attributes of the partition 3 and the partition 1 (i.e., the partition 5 and the partition 1 obtained by splitting) in the dynamic index are updated in step S32, and the data belonging to the partition 4 is not changed so that the attribute of the partition 4 is not changed, thereby completing the update of the dynamic index. In this case, the dynamic index includes 4 partitions (partitions 1, 2, 3, 4) and the corresponding attributes (kernel and compactness) of the 4 partitions. The data structure of the dynamic index can be referred to table 1 and the related description above.
In the above example, in the case of an existing dynamic index, for each piece of received data, it may choose to attribute the data to an existing partition, and then go through the determination of whether the partition is split or not and the determination of whether the split partition is merged or not. That is, if the partition with the strongest association degree with the data is split into two partitions in step S31, after receiving a certain piece of data and performing partition splitting, the total number of partitions may be exactly equal to the threshold value of the number of partitions, in this case, if the next piece of data is received and it is determined that the partitions can be split according to the data processing method in the embodiment of the present application, the total number of partitions after splitting may exceed the threshold value of the number of partitions, and the total number of partitions is one more than the threshold value of the number of partitions. Therefore, after the two partitions with the strongest association degree are merged into one partition in step S34, the total number of partitions may be equal to the threshold number of partitions, without comparing the size of the merged total number of partitions with the threshold number of partitions.
In a possible implementation manner, after the partition with the strongest association degree with the data in the dynamic index is split into multiple partitions in step S31, the core and the compactness of each partition in the current partition mode may not be calculated first, but the total number of the split partitions is calculated first, and whether the partitions in the dynamic index need to be merged is determined according to a size relationship between the total number of the partitions and a preset partition number threshold. And under the condition that the dynamic index does not need to be merged, for example, the total number of the partitions is less than or equal to the partition number threshold, and then the core and the partition compactness of the split partitions are calculated, so that the updating of the dynamic index is completed. In the case of merging, for example, the total number of partitions is greater than the threshold number of partitions, the kernel of the split partition may be calculated first, and steps S33 to S34 are continued to complete determining the proximity of every two partitions after the partition splitting and merging the partitions, and then the kernel and the compactness of the merged partition are calculated, thereby completing updating the dynamic index.
It should be understood by those skilled in the art that, on one hand, the core and the compactness of a partition may be calculated in real time according to the change of the partition mode, and on the other hand, after receiving new data and determining a partition with the strongest association degree with the data, there may be a change of partition modes such as partition splitting and partition merging, and in the case that it is not determined whether the changed partition mode is a suitable partition mode (for example, the partition compactness is less than or equal to a first threshold and/or the total number of partitions is less than or equal to a partition number threshold), the core or the compactness in the changed partition parameter may not be calculated immediately as long as it is possible to determine a suitable partition mode according to the core or the compactness in the changed partition parameter and calculate the core and the compactness in the partition parameter after determining the suitable partition mode.
In a possible implementation manner, if the partition to which the selected data belongs is split into multiple partitions in step S31, after step S35, the total number of the partitions after merging may be determined, and if the total number of the partitions still exceeds the threshold of the number of the partitions, the operations of measuring the proximity and merging the partitions may be repeated until it is determined that the number of the partitions after merging is less than or equal to the threshold of the number of the partitions. The partition merging mode can be correspondingly set according to the partition splitting mode, and the specific mode of partition merging is not limited in the application.
It should be understood by those skilled in the art that, in addition to being set according to the computing capability of the cluster node, the partition number threshold may also be set according to other parameters, such as the computing capability of the receiving device, as long as the partition number threshold can be set in advance before the receiving device receives the streaming data, and the setting manner of the partition number threshold is not limited in the present application.
In this way, when a dynamic index exists on the receiving device, data distribution and updating of the dynamic index can be completed by combining the dynamic index. So that the dynamic index can participate in the subsequent data splitting.
An example of a data processing method according to an embodiment of the present application when no dynamic index exists on the receiving device is described below with reference to fig. 12.
Fig. 12 shows an example of a data processing method according to an embodiment of the present application. As shown in fig. 12, in a possible implementation manner, the data processing method according to the embodiment of the present application is executed by a receiving device, where the receiving device may be a terminal device or a server. The method can comprise the following steps:
and S11, receiving data from the sending equipment. The specific implementation manner thereof can refer to the above step S21.
S12, determining whether the dynamic index exists on the receiving device.
And S13, under the condition that the dynamic index does not exist, creating the dynamic index.
For example, the data received in step S11 may be, for example, the 1 st data received after the streaming data is received, and no dynamic index may exist on the receiving device. In this case, a dynamic index may be created, where only one partition is included in the dynamic index. The setting of the corresponding manner of the partition 1 and the cluster node may refer to the setting of the corresponding manner of the partition 5 and the cluster node, that is, the partition 1 may be designated in advance to correspond to a specific cluster node, or the partition 1 may correspond to the cluster node with the smallest number of the corresponding partitions, and if none of the plurality of cluster nodes has a corresponding partition, any one of the plurality of cluster nodes may correspond to the partition 1.
And S14, outputting the data to the cluster nodes corresponding to the partitions according to the corresponding relation between the partitions and the cluster nodes.
For example, partition 1 may correspond to cluster node 1, for example. Since the dynamic index includes only one partition, the receiving device may output the data received in step S11 to the 1 st cluster node corresponding to partition 1.
And S15, updating the partition attributes in the dynamic index according to the data.
For example, since the dynamic index includes only one partition, the one partition may be selected to be the partition to which the data is directly attributed, e.g., partition 1. In this case, there is and only one piece of data (1 st piece of data) belonging to the partition 1, and the attribute of the partition 1 can be determined from the one piece of data (1 st piece of data). Taking the example that each piece of data of the streaming data includes two dimensions a and b, and the 1 st piece of data is (a 1, b 1), in this case, the position of the 1 st piece of data in the two-dimensional coordinate system with the dimensions a and b as coordinate axes is also (a 1, b 1), at this time, since there is only one piece of data in the partition 1, the core M1 of the partition 1 may be, for example, the position (a 1, b 1) of the 1 st piece of data, and the compactness D1 of the partition 1 may be, for example, 0. After determining the core M1 and the tightness D1, the core M1 and the tightness D1 may be stored in correspondence with the partition 1, and the update of the dynamic index may be completed. In this case, one partition (partition 1) and the corresponding attribute (kernel M1, compactness D1) of the one partition are included in the dynamic index. The data structure of the dynamic index can be referred to table 1 and the related description above.
In this way, when no dynamic index exists on the receiving device, the data distribution and the creation and update of the dynamic index can be completed. So that the dynamic index can participate in the subsequent data splitting.
In one possible implementation, after steps S24 and S14, the data may be clustered locally on each cluster node. The process can be realized based on the prior art such as a K-MEANS clustering algorithm. The clustering algorithm adopted by each cluster node can be preset, and the receiving equipment is determined to perform aggregation measurement and calculation to determine the partition attribute based on the preset clustering algorithm, so that the aggregation measurement and calculation of the receiving equipment and the local clustering of the cluster nodes use the same algorithm.
Local clustering is typically the clustering of a batch of data, which may be data from a receiving device received by a cluster node within a time window. That is, for a certain piece of data flowing into a cluster node, when the cluster node clusters the piece of data, the historical inflow data before the inflow time of the piece of data and the future inflow data after the inflow time of the piece of data may be combined to jointly complete local clustering of the data. The data targeted for local clustering may include, for example, all data flowing into the cluster node between the start and end of a time window. Local clustering may be performed, for example, when the end of a time window is reached. The clustering result can be reported to a management node (not shown) or other devices in the distributed system such as the receiving device once the clustering result is obtained.
Fig. 13 illustrates an example of a time window according to an embodiment of the application. Common alternative window forms are a scrolling window (window 1-window 4 to the right of the "scrolling window" in fig. 13) and a sliding window (window 1-window 3 to the right of the "sliding window" in fig. 13). When the rolling window is adopted, the data (event stream) flowing into the cluster node in each time window is different from the data flowing into the same cluster node in other time windows in data flowing time; when the sliding time window is adopted, the data flowing into the cluster node in each time window has the same inflow time of partial data as the data flowing into the same cluster node in the previous time window and the next time window of the time window. When a rolling window is used, after data clustering of a certain time window is finished, the data of the time window can be emptied, so that the data storage cost of cluster nodes is reduced; when the sliding window is used, when any piece of data is clustered, the data is clustered together with the front data and the rear data of the data, the probability that the data of the same community are clustered in two time windows respectively, so that the data of the same community cannot belong to the same partition and can not flow to the same cluster node is reduced, and the accuracy of a clustering result can be improved.
In a possible implementation manner, streaming data continuously enters the receiving device, so that the dynamic index of the receiving device is continuously updated, and data newly entering the receiving device is shunted based on the updated dynamic index, which makes the data entering the receiving device first affect the shunting situation of the data entering the receiving device later. If the data entering the receiving device first still has timeliness compared with the data entering the receiving device later, for example, when the time difference between the two is small, it can be considered that the accuracy of guiding the data entering later to perform shunting is high based on the dynamic index determined by the data entering first; if the data entering the receiving device first does not have timeliness compared with the data entering the receiving device later, for example, when the time difference between the two is large, the accuracy of shunting the data entering after guiding is low due to the dynamic index determined based on the data entering first, the dynamic index can be processed, and the accuracy of guiding data shunting by using the dynamic index is improved.
For example, if part or all of the data used in the creation or update process of the dynamic index is too old for the data received by the receiving device at the current time, the part of the too old data does not affect the current data splitting. The dynamic index which is processed and is not influenced by old data can be obtained by processing the dynamic index.
Wherein the dynamic index may, for example, indicate the time of receipt of data attributed to each partition, in addition to the partitions and attributes described above in table 1. The processing of the dynamic index may be triggered by a user instruction or may be initiated actively by the receiving device. When triggered by a user, the user's instruction may instruct, for example, the receiving device to perform index reconstruction and update, according to the instruction of the user instruction, the receiving device may determine, as a time starting point, a time point, which is spaced from a time when data was received last time by a preset time threshold and is earlier than the time when data was received last time, and use the time when data was received last time as a time ending point, recreate the dynamic index according to a first piece of data received after the time starting point, and update the dynamic index according to the data received by the receiving device from the time starting point to the time ending point, where each piece of data is updated once, and a specific creation and update manner may refer to fig. 12 and 5. Alternatively, the data received by the receiving device from the time start point to the time end point may be clustered, the clustering result may be used as a partition, and the parameters of the partition may be determined in the exemplary manner described above. In this case, the reconstructed and updated dynamic index is a dynamic index from which the influence of old data is eliminated, and when the index is used for shunting, shunting of data is not influenced by excessively old data.
When the processing of the dynamic index is actively initiated by the receiving device, the receiving device may trigger the processing of the dynamic index at a certain frequency. The specific processing manner may be the same as that triggered by the user instruction.
Those skilled in the art will appreciate that there are many options for the way to remove the effect of old data in the dynamic index, and the application is not limited thereto.
In a possible implementation manner, when the type of streaming data to be received is known, and the receiving device has received the same type of streaming data and completed data splitting, or can obtain the same type of non-streaming data, before receiving the streaming data, a dynamic index may be created according to the same type of streaming or non-streaming data that has been received. For example, the streaming data to be received is information of a student who enters in 2021, and the reception apparatus has once received and created a dynamic index for the information of the student who enters in 2020, in which case the dynamic index created with the information of the student who enters in 2020 may be used as the initial dynamic index used when the information of the student who enters in 2021 is received, and according to the newly received data, the branching of the newly received data is completed, and the dynamic index is updated. Alternatively, the initial dynamic index may be created using the clustering results of information for students who entered in 2020. Whether the old dynamic index or the old data can be used in the new data shunting process or not can be determined according to the timeliness of the data, and the old dynamic index or the old data can be used in the new data shunting process on the premise that the old dynamic index or the old data still has timeliness. In this way, the accuracy of shunting data by the receiving device by using the dynamic index can be improved.
In a possible implementation manner, whether the old dynamic index or the old data has timeliness or not may be actively determined by the receiving device, or may be triggered by a user instruction to determine. The specific implementation manner of determining the timeliness may refer to the above description, and is not described herein again.
According to the data processing method, the data flow to which distributed node is determined by judging the partition to which the data belongs, so that the data gathered in the same community on the distributed nodes flow to the same node, the problem of cluster cutting in a distributed system is solved, the probability of the data of the same community flowing into the same node is improved, and the clustering effect can be optimized. According to the data processing method, the calculated amount of the partition to which the data belongs is judged to be far smaller than the calculated amount of clustering the data, and the data operation cost is reduced. Meanwhile, the scheme is simple, additional characteristic design and rule design are not needed, and the complexity of data distribution is reduced. The judgment mechanism (dynamic index) of the data attribution partition of the embodiment of the application is updated in real time along with data, so that the adaptability to different application scenes is strong, the accuracy of the partition can be improved by dynamically updating the dynamic index, and the problem that errors existing in a fixed partition mechanism cannot be corrected is solved.
Fig. 14 shows an exemplary flowchart of a data processing method according to an embodiment of the present application. The method is applied to a distributed system, the distributed system comprises a distribution guide node and a plurality of processing nodes, wherein the distribution guide node stores a dynamic index, the dynamic index comprises a plurality of partitions, and each partition corresponds to one processing node, and the method comprises the following steps:
s1401, the shunting direction node receives data, and selects a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index;
s1402, the shunting guidance node outputs the data to a processing node corresponding to the selected partition;
s1403, the offloading steering node updates the dynamic index according to the data belonging to the selected partition.
The data processing method according to the embodiment of the present application may refer to fig. 5 and the related description, the offloading direction node for executing the data processing method may correspond to the receiving device in the related description, and the processing node may correspond to the cluster node in the related description. Optionally, the offloading guidance node executing the data processing method may also be one of the cluster nodes.
An example of step S1401 may be referred to above as well as steps S21, S22, S23 in fig. 5, and an example of step S1402 may be referred to above as well as step S24 in fig. 5, wherein the dynamic index may be referred to as in the example of fig. 6, the selected partition may be, for example, partition 1 in fig. 7, and the parameters of the partition in the dynamic index may, for example, include the partition properties in the above-mentioned related description of step S23, i.e., the core and the compactness in fig. 6. An example of step S1403 can be found in step S25 above and in fig. 5.
According to the data processing method of the embodiment of the application, the partition to which the data belongs can be selected from a plurality of partitions of the dynamic index through the parameters of the partitions in the dynamic index, so that when the data is output to the processing node corresponding to the selected partition according to the corresponding relation between the partitions and the processing node, the data already belonging to the selected partition exists on the processing node, and the data belonging to the same partition can be distributed to the same node; after the partition to which the data belongs is selected, the dynamic index can be updated according to the data belonging to the selected partition, so that when the subsequent received data is output to the processing node according to the updated dynamic index, the data distribution accuracy can be further improved, the problem that errors cannot be corrected in a fixed partition mechanism is solved, the probability that the data belonging to the same community flows into the same node is improved, the adverse effect caused by the cluster splitting problem is reduced, and the streaming clustering effect and the adaptability to different application scenes are improved.
In one possible implementation, the parameters of each partition include a core indicating a center position of the partition and an affinity indicating a degree of affinity between data attributed to the partition. Examples of the core and the tightness may refer to the core M and the tightness D in the related description of fig. 4 above, among others.
By the method, the partition parameters can be used for reflecting the relevant information of the data of the partition, and the accuracy of selecting the partition to which the data belongs according to the partition parameters can be ensured.
In a possible implementation manner, the selecting, according to a parameter of each partition in the dynamic index, a partition to which the data belongs from the plurality of partitions includes: determining a proximity of the data to a core of each of the plurality of partitions; and selecting one partition with the core closest to the data as a partition to which the data belongs.
One example of determining the proximity may refer to fig. 7 and the related description above.
In this way, the partition attribution of the received data may be determined. The data is closer to the core of the partition, so that the accuracy of the data belonging to the partition is higher, the mode of determining the proximity degree of the partition core and the data is simpler, and the complexity of selecting the partition to which the data belongs can be reduced.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition includes: updating the core and the compactness of said selected partition according to data attributed to said selected partition.
An exemplary implementation of which can be found in relation to the description above and in relation to fig. 8.
By the method, the parameters of the partition can be updated on the premise of not changing the partition mode, so that the updating mode of the dynamic index is simpler and more convenient. After the partition to which the data belongs is selected, the data belonging to the selected partition is changed, and the core and the compactness of the selected partition are updated according to the changed data, so that when new data is subsequently received, the partition to which the new data belongs can be selected according to the updated core and the compactness, that is, whether the new data has a strong proximity to the selected partition or not is judged according to the updated core and the compactness, and the accuracy of the data partition is improved.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition includes: updating the compactness of said selected partition according to data belonging to said selected partition; determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index; and updating the core and the compactness of each partition according to the data belonging to each partition in the dynamic index.
Wherein, the example of the updated closeness may be referred to as closeness D' in the above, and the example of the first threshold may be referred to as first threshold U in the above, determining that the updated closeness is greater than the first threshold, splitting the selected partition in the dynamic index into a plurality of partitions, which may be, for example, partition 1 in fig. 9, split into a plurality of partitions, which may be, for example, partitions 1-1 and partitions 1-2 in fig. 9, may be referred to as step S31 in the above and fig. 9, and updating the core and closeness examples of the partitions according to the data attributed to the partitions in the dynamic index, which may be, for example, referred to as step S32 in the above and fig. 9.
The partition mode is changed through partition splitting, so that the partition parameters are changed accordingly, and the updating of the dynamic index can be completed. Because the partition splitting is performed when the compactness does not meet the threshold condition, the compactness of the partition of the obtained dynamic index is better after the dynamic index is updated in a splitting mode, and the accuracy of selecting the partition to which the data belongs according to the updated dynamic index is higher.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition further includes: determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to be merged; updating the core and the affinity of the merged partition according to the data belonging to the merged partition.
An example of the second threshold may refer to the partition number threshold W in the above, determine that the total number of partitions in the dynamic index is greater than the second threshold, and select a plurality of partitions in the dynamic index to merge, which may refer to step S34 in the above and fig. 11; an example of updating the core and the compactness of the merged partition according to the data attributed to the merged partition may be seen above and in step S35 in fig. 11.
In this way, the partition mode and the partition parameters are changed, and the update of the dynamic index can be completed. The partition merging is performed when the total number of the partitions does not meet the threshold condition, so that the number of the partitions of the dynamic index obtained after the dynamic index is updated in a merging mode is more optimal, the total number of the partitions is related to the number of partition parameters in the dynamic index, and when the number of the partitions is more optimal, the time cost and the operation cost for selecting the partition to which the data belongs are also more optimal according to the parameters of the partitions in the dynamic index, so that when the partition to which the data belongs is selected according to the updated dynamic index, higher calculation efficiency can be achieved on the premise of ensuring high enough accuracy.
In one possible implementation, selecting a plurality of partitions in the dynamic index to merge includes: determining the proximity of every two partitions in the dynamic index; the two partitions closest to each other are selected and merged into one partition.
For an example of determining the proximity of every two partitions in the dynamic index, see step S33 in fig. 10 and above. The partition merging method for merging two partitions that are closest to each other into one partition may be performed when the selected partition is split into two partitions.
In this way, partition merging may be accomplished such that the total number of partitions after merging is reduced to more closely approximate the requirements of the threshold condition.
In one possible implementation, the method further includes:
determining whether data attributed to the dynamically indexed partition is time-sensitive; and when part or all of the data lose timeliness, deleting the dynamic index, or updating the dynamic index according to the data with timeliness.
For example, the timeliness of the data and the updating of the dynamic index according to the timeliness data may refer to the related description after fig. 13.
By the method, adverse effects of old data on the dynamic index can be removed, and the accuracy of selecting the partition to which the data belongs according to the dynamic index is further improved.
Fig. 15 shows an exemplary structural diagram of a data processing apparatus according to an embodiment of the present application.
An embodiment of the present application provides a data processing apparatus, where the apparatus is applied to a distributed system, where the distributed system includes a offload guidance node and a plurality of processing nodes, where the offload guidance node stores a dynamic index, the dynamic index includes a plurality of partitions, and each partition corresponds to one processing node, as shown in fig. 15, the apparatus includes:
a partition determining module 101, disposed on the offloading guide node, configured to receive data, and select a partition to which the data belongs from the multiple partitions according to parameters of each partition in the dynamic index;
a data output module 102, disposed on the offloading guidance node, and configured to output the data to a processing node corresponding to the selected partition;
and an index updating module 103, disposed on the split-flow directing node, configured to update the dynamic index according to the data belonging to the selected partition.
In one possible implementation, the parameters of each partition include a kernel indicating a center position of the partition and a closeness indicating a degree of closeness between data attributed to the partition.
In a possible implementation manner, the selecting, according to the parameter of each partition in the dynamic index, a partition to which the data belongs from the plurality of partitions includes: determining a proximity of the data to a core of each of the plurality of partitions; and selecting one partition with the core closest to the data as a partition to which the data belongs.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition includes: updating the core and the affinity of the selected partition according to data attributed to the selected partition.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition includes: updating the affinity of said selected partition according to data belonging to said selected partition; determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index; and updating the core and the compactness of each partition according to the data belonging to each partition in the dynamic index.
In a possible implementation manner, the updating the dynamic index according to the data belonging to the selected partition further includes: determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to merge; updating the core and the affinity of the merged partition according to the data belonging to the merged partition.
In one possible implementation, selecting a plurality of partitions in the dynamic index to merge includes: determining the proximity of every two partitions in the dynamic index; the two partitions closest to each other are selected and merged into one partition.
In one possible implementation, the apparatus further includes:
the time effectiveness determining module is arranged on the shunting guidance node and used for determining whether the data of the partition belonging to the dynamic index has time effectiveness or not;
and the index processing module is arranged on the shunting guide node and used for deleting the dynamic index when partial or all data lose timeliness or updating the dynamic index according to the data with timeliness.
An embodiment of the present application provides a data processing apparatus, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above method when executing the instructions.
Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
Embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.
Fig. 16 shows an exemplary structural diagram of a data processing apparatus according to an embodiment of the present application.
The data processing apparatus may include at least one of a mobile phone, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an Artificial Intelligence (AI) device, a wearable device, a vehicle-mounted device, a smart home device, a smart city device, and a server device. The embodiment of the present application does not particularly limit the specific type of the data processing apparatus.
The data processing apparatus may include a processor 110, a memory 121, and a communication module 160. It is to be understood that the illustrated structure of the embodiments of the present application does not constitute a specific limitation to the data processing apparatus. In other embodiments of the present application, the data processing apparatus may include more or fewer components than those shown, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.
The processor can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 may be a cache memory. The memory may hold instructions or data that are used or used more frequently by the processor 110. If the processor 110 needs to use the instructions or data, it can call directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.
The memory 121 may be used to store computer-executable program code, which includes instructions. The memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a data acquisition function and a data transmission function) required by at least one function, and the like. The data storage area can store data (such as dynamic indexes) created during the use of the data processing device, and the like. In addition, the memory 121 may include a high speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a Universal Flash Storage (UFS), and the like. The processor 110 performs various functional methods of the data processing apparatus or the above-described data processing method by executing instructions stored in the memory 121 and/or instructions stored in a memory provided in the processor.
The communication module 160 may be configured to receive streaming data (e.g., a sending device in this embodiment) from other apparatuses or devices through wireless communication/wired communication, and output the streaming data to other apparatuses or devices (e.g., a cluster node in this embodiment). For example, solutions for wireless communication including WLAN (such as Wi-Fi network), bluetooth (BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like may be provided.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable Programmable Read-Only Memory (EPROM or flash Memory), a Static Random Access Memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a Memory stick, a floppy disk, a mechanical coding device, a punch card or an in-groove protrusion structure, for example, having instructions stored thereon, and any suitable combination of the foregoing.
The computer readable program instructions or code described herein may be downloaded to the respective computing/processing device from a computer readable storage medium, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present application may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize custom electronic circuitry, such as Programmable Logic circuits, field-Programmable Gate arrays (FPGAs), or Programmable Logic Arrays (PLAs).
Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., a Circuit or an ASIC) for performing the corresponding function or action, or by combinations of hardware and software, such as firmware.
While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (19)

1. A data processing method is applied to a distributed system, and the distributed system comprises a distribution guide node and a plurality of processing nodes, wherein the distribution guide node stores a dynamic index, the dynamic index comprises a plurality of partitions, and each partition corresponds to one processing node, and the method comprises the following steps:
the shunting guide node receives data, and selects a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index;
the shunting guide node outputs the data to a processing node corresponding to the selected partition;
and the distribution guide node updates the dynamic index according to the data belonging to the selected partition.
2. The method of claim 1, wherein the parameters for each partition include a kernel and a closeness, the kernel indicating a center position of the partition, the closeness indicating a degree of closeness between data attributed to the partition.
3. The method according to claim 2, wherein the selecting the partition to which the data belongs from the plurality of partitions according to the parameters of the partitions in the dynamic index comprises:
determining a proximity of the data to a core of each of the plurality of partitions;
and selecting one partition with the core closest to the data as a partition to which the data belongs.
4. A method according to any of claims 1-3, wherein said updating said dynamic index based on data attributed to said selected partition comprises:
updating the core and the affinity of the selected partition according to data attributed to the selected partition.
5. A method according to any of claims 1-3, wherein said updating said dynamic index based on data attributed to said selected partition comprises:
updating the affinity of said selected partition according to data belonging to said selected partition;
determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index;
and updating the core and the compactness of each partition according to the data belonging to each partition in the dynamic index.
6. The method of claim 5, wherein updating the dynamic index based on data attributed to the selected partition further comprises:
determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to be merged;
updating the core and the affinity of the merged partition according to the data belonging to the merged partition.
7. The method of claim 6, wherein selecting the plurality of partitions in the dynamic index for merging comprises:
determining the proximity of every two partitions in the dynamic index;
the two partitions closest to each other are selected and merged into one partition.
8. The method according to any one of claims 1-7, further comprising:
determining whether data attributed to the dynamically indexed partition is time-sensitive;
and when part or all of the data lose timeliness, deleting the dynamic index, or updating the dynamic index according to the data with timeliness.
9. A data processing apparatus, which is applied to a distributed system including a offload guidance node and a plurality of processing nodes, wherein the offload guidance node stores a dynamic index, the dynamic index includes a plurality of partitions, and each partition corresponds to one processing node, and the apparatus includes:
the partition determining module is arranged on the shunting guide node and used for receiving data and selecting a partition to which the data belongs from the plurality of partitions according to parameters of the partitions in the dynamic index;
the data output module is arranged on the shunting guide node and used for outputting the data to the processing node corresponding to the selected partition;
and the index updating module is arranged on the flow distribution guide node and used for updating the dynamic index according to the data belonging to the selected partition.
10. The apparatus of claim 9, wherein the parameters for each partition comprise a kernel and a closeness, the kernel indicating a center position of the partition, the closeness indicating a degree of closeness between data attributed to the partition.
11. The apparatus of claim 10, wherein the selecting the partition to which the data belongs from the plurality of partitions according to the parameters of the partitions in the dynamic index comprises:
determining a proximity of the data to a core of each of the plurality of partitions;
and selecting one partition with the core closest to the data as a partition to which the data belongs.
12. The apparatus according to any of claims 9-11, wherein said updating the dynamic index according to the data attributed to the selected partition comprises:
updating the core and the affinity of the selected partition according to data attributed to the selected partition.
13. The apparatus according to any of claims 9-11, wherein said updating the dynamic index according to the data attributed to the selected partition comprises:
updating the affinity of said selected partition according to data belonging to said selected partition;
determining that the updated closeness is greater than a first threshold, splitting the selected partition into a plurality of partitions in the dynamic index;
and updating the core and the compactness of each partition according to the data belonging to each partition in the dynamic index.
14. The apparatus of claim 13, wherein the updating the dynamic index according to the data attributed to the selected partition further comprises:
determining that the total number of the partitions in the dynamic index is greater than a second threshold value, and selecting a plurality of partitions in the dynamic index to be merged;
updating the core and the affinity of the merged partition according to the data belonging to the merged partition.
15. The apparatus of claim 14, wherein selecting the plurality of partitions in the dynamic index for merging comprises:
determining the proximity of every two partitions in the dynamic index;
the two partitions closest are selected to be merged into one partition.
16. The apparatus according to any one of claims 9-15, wherein the apparatus further comprises:
the time effectiveness determining module is arranged on the shunting guide node and used for determining whether the data of the partition which belongs to the dynamic index has time effectiveness or not;
and the index processing module is arranged on the shunting guide node and is used for deleting the dynamic index when partial or all data lose timeliness, or updating the dynamic index according to the data with timeliness.
17. A data processing apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1-8 when executing the instructions.
18. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-8.
19. A computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code which, when run in an electronic device, a processor in the electronic device performs the method of any of claims 1-8.
CN202110749898.2A 2021-07-02 2021-07-02 Data processing method, data processing apparatus, storage medium, and program product Pending CN115563097A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110749898.2A CN115563097A (en) 2021-07-02 2021-07-02 Data processing method, data processing apparatus, storage medium, and program product
PCT/CN2022/095268 WO2023273727A1 (en) 2021-07-02 2022-05-26 Data processing method and apparatus, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110749898.2A CN115563097A (en) 2021-07-02 2021-07-02 Data processing method, data processing apparatus, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN115563097A true CN115563097A (en) 2023-01-03

Family

ID=84692499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110749898.2A Pending CN115563097A (en) 2021-07-02 2021-07-02 Data processing method, data processing apparatus, storage medium, and program product

Country Status (2)

Country Link
CN (1) CN115563097A (en)
WO (1) WO2023273727A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8886649B2 (en) * 2012-03-19 2014-11-11 Microsoft Corporation Multi-center canopy clustering
JP6972119B2 (en) * 2016-09-15 2021-11-24 オラクル・インターナショナル・コーポレイション Spatial change detector in stream data
CN106570104B (en) * 2016-11-01 2020-04-07 南京理工大学 Multi-partition clustering preprocessing method for stream data
CN108804556B (en) * 2018-05-22 2020-10-20 上海交通大学 Distributed processing framework system based on time travel and temporal aggregation query
US10983954B2 (en) * 2019-05-24 2021-04-20 Hydrolix Inc. High density time-series data indexing and compression

Also Published As

Publication number Publication date
WO2023273727A1 (en) 2023-01-05

Similar Documents

Publication Publication Date Title
US11514063B2 (en) Method and apparatus of recommending information based on fused relationship network, and device and medium
CN107305637B (en) Data clustering method and device based on K-Means algorithm
US9286312B2 (en) Data coreset compression
CN108959370B (en) Community discovery method and device based on entity similarity in knowledge graph
US11503033B2 (en) Using one or more networks to assess one or more metrics about an entity
US20220058503A1 (en) Accurate and interpretable rules for user segmentation
US10956976B2 (en) Recommending shared products
CN111460234A (en) Graph query method and device, electronic equipment and computer readable storage medium
CN112580733A (en) Method, device and equipment for training classification model and storage medium
CN110633717A (en) Training method and device for target detection model
CN111221827B (en) Database table connection method and device based on graphic processor, computer equipment and storage medium
CN112989170A (en) Keyword matching method applied to information search, information search method and device
CN115563097A (en) Data processing method, data processing apparatus, storage medium, and program product
CN111291092A (en) Data processing method, device, server and storage medium
CN112967044B (en) Payment service processing method and device
CN115293252A (en) Method, apparatus, device and medium for information classification
US11593014B2 (en) System and method for approximating replication completion time
US11599583B2 (en) Deep pagination system
Xu et al. Dm-KDE: dynamical kernel density estimation by sequences of KDE estimators with fixed number of components over data streams
CN115878989A (en) Model training method, device and storage medium
CN112364258A (en) Map-based recommendation method, system, storage medium and electronic device
CN113609378B (en) Information recommendation method and device, electronic equipment and storage medium
CN110298679A (en) A kind of method and apparatus calculating the distance between sample data
CN114239608B (en) Translation method, model training method, device, electronic equipment and storage medium
CN114547448B (en) Data processing method, model training method, device, equipment, storage medium and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication