CN113469242A

CN113469242A - Multithreading-based clustering data processing method and data processing equipment

Info

Publication number: CN113469242A
Application number: CN202110732165.8A
Authority: CN
Inventors: 吴昆临; 许秋子
Original assignee: Shenzhen Realis Multimedia Technology Co Ltd
Current assignee: Shenzhen Realis Multimedia Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-01

Abstract

The application provides a multithreading-based clustering data processing method and data processing equipment, wherein the multithreading-based clustering data processing method comprises the following steps: configuring preset parameters, wherein the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads respectively correspond to the n groups of parameters; the n threads respectively process the n groups of parameters to obtain n adjacent point groups; and the n threads respectively process the n adjacent point groups to finish clustering processing. The clustering data processing method and the data processing equipment based on multithreading improve the calculation speed of clustering data processing according to multithreading processing.

Description

Multithreading-based clustering data processing method and data processing equipment

Technical Field

The invention relates to the field of computer communication, in particular to a multithreading-based clustering data processing method and data processing equipment.

Background

Clustering analysis, also known as cluster analysis, is a statistical analysis method for studying (sample or index) classification problems, and is also an important algorithm for data mining. Clustering (Cluster) analysis is composed of several patterns (patterns), which are typically vectors of a metric (measure) or a point in a multidimensional space. Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster.

Clustering analysis is to divide data with similar parameters (e.g. 3D points with similar positions) into different clusters, and its application is very wide including data mining, machine learning, etc. There are many algorithms for cluster analysis, some of which are based on the density of the data, such as density-based clustering algorithm (DBSCAN).

The computational complexity of the existing DBSCAN algorithm is O (N2), the square of the number of points, which is very time consuming once the number of points increases.

Disclosure of Invention

The application provides a clustering data processing method and data processing equipment based on multithreading, which improve the calculation speed of clustering data processing according to multithreading processing.

In view of the fact that the calculation complexity of the existing DBSCAN algorithm is O (N2), which is the square of the number of points, and once the number of points increases, it is very time-consuming, the present application provides, in a first aspect, a multithreading-based clustered data processing method, where the multithreading-based clustered data processing method includes: configuring preset parameters, wherein the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads respectively correspond to the n groups of parameters; the n threads respectively process the n groups of parameters to obtain n adjacent point groups; and the n threads respectively process the n adjacent point groups to finish clustering processing. The clustering data processing method and the data processing equipment based on multithreading improve the calculation speed of clustering data processing according to multithreading processing.

Based on the first aspect of the embodiment of the present application, in a first implementation manner of the first aspect of the embodiment of the present application, the processing, by the n threads, the n neighboring point groups respectively to complete clustering includes: the n threads calculate n clusters according to the n neighbor point groups; and the n threads finish clustering processing according to the n clusters.

Based on the first implementation manner of the first aspect of the embodiment of the present application, in the second implementation manner of the first aspect of the embodiment of the present application, the completing, by the n threads, clustering according to the n clusters includes: and when the adjacent clusters are equal, the n threads calculate to finish the clustering processing.

Based on any one implementation manner of the first aspect of the embodiment of the present application to the second implementation manner of the first aspect of the embodiment of the present application, in a third implementation manner of the first aspect of the embodiment of the present application, the preset parameter includes: neighbor distance and minimum number per cluster.

A second aspect of the present application provides a data processing apparatus comprising: the configuration unit is used for configuring preset parameters, the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads correspond to the n groups of parameters respectively; the first processing unit comprises n threads which respectively process the n groups of parameters to obtain n neighbor point groups; and the second processing unit comprises the n threads which respectively process the n adjacent point groups to finish clustering processing. The clustering data processing method and the data processing equipment based on multithreading improve the calculation speed of clustering data processing according to multithreading processing.

Based on the second aspect of the embodiment of the present application, in the first implementation manner of the second aspect of the embodiment of the present application, the second processing unit is specifically configured to calculate, by the n threads, n clusters according to the n neighboring point groups; and the n threads finish clustering processing according to the n clusters.

Based on the first implementation manner of the second aspect of the embodiment of the present application, in the second implementation manner of the second aspect of the embodiment of the present application, the second processing unit is specifically configured to, when there is equivalence between adjacent clusters, calculate that the clustering process is completed by the n threads.

Based on any one implementation manner of the second aspect of the example of the present application to the second implementation manner of the second aspect of the example of the present application, in a third implementation manner of the second aspect of the example of the present application, the preset parameters include: neighbor distance and minimum number per cluster.

A third aspect of the embodiments of the present application provides a clustered data processing system based on multiple threads, where the clustered data processing system based on multiple threads includes: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the multithreading-based clustering data processing system to execute the multithreading-based clustering data processing method as described in any one of the possible implementations of the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the computer-executable instructions may cause the one or more processors to perform the above-mentioned clustering data processing method based on multiple threads.

A fifth aspect of embodiments of the present application provides a computer program product, which is characterized in that the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, when the program instructions are executed by a processor, the program instructions cause the processor to execute the above clustering data processing method based on multiple threads.

Drawings

FIG. 1 is a schematic flow chart of a method for processing clustered data based on multiple threads in an embodiment of the present application;

FIG. 2 is a schematic diagram of a clustering data processing method based on multithreading according to an embodiment of the present application;

FIG. 3 is a functional block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a clustering data processing apparatus based on multithreading in an embodiment of the present application.

Detailed Description

The embodiment of the application provides a multithreading-based clustering data processing method and related equipment, and aims to improve the calculation speed of clustering data processing according to multithreading processing.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple. It is to be noted that "at least one item" may also be interpreted as "one or more item(s)".

It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

Clustering analysis is to divide data with similar parameters (e.g. 3D points with similar positions) into different clusters, and its application is very wide including data mining, machine learning, etc. There are many algorithms for cluster analysis, some of which are based on the density of the data, such as density-based clustering algorithm (DBSCAN). Compared with the K-means method, the DBSCAN does not need to know the number of cluster classes to be formed in advance, the DBSCAN can find the cluster classes in any shape, and meanwhile, the DBSCAN can identify noise points. DBSCAN is not sensitive to the order of the samples in the database, i.e. the input order of Pattern has little effect on the results. However, for boundary samples between cluster classes, the attribution may be swung according to which cluster class is preferentially detected.

In view of the above problem, the present application provides a virtual reality action triggering method, please refer to fig. 1, which includes:

s101, configuring preset parameters, wherein the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads correspond to the n groups of parameters respectively;

s102, the n threads respectively process n groups of parameters to obtain n neighbor point groups;

s103, calculating n clusters by the n threads according to the n neighbor point groups;

and S104, finishing clustering processing by the n threads according to the n clusters.

In this embodiment, the data processing device may be a terminal device, which is also referred to as a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), and the like, and is a device that provides voice and/or data connectivity to a user. Such as a handheld device, a vehicle-mounted device, etc., having a wireless connection function. The terminal device may also be referred to simply as a terminal. Currently, some examples of terminals are: a mobile phone (mobile phone), a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote operation (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in city (city), a wireless terminal in smart home (smart home), a sensor, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (l, local) phone, a wireless local wireless Personal Digital Assistant (PDA), a wireless Personal Digital Assistant (PDA) device with wireless communication function, and a wireless communication function, A computing device or other processing device connected to a wireless modem, an in-vehicle device, a wearable device, a terminal device in a 5G network or a terminal device in a Public Land Mobile Network (PLMN) for future evolution, etc.

The following will explain the steps separately:

the technical scheme provided by the embodiment of the application can be applied to various communication systems, such as: a Long Term Evolution (LTE) system, a fifth generation (5th generation, 5G) mobile communication system, a sixth generation (6th generation, 6G) mobile communication system, a wireless fidelity (WiFi) system, a short-range communication system, a satellite communication system, an internet of vehicles communication system, a non-terrestrial communication system, a future communication system, or a system in which multiple communication systems are integrated, and the like, which are not limited in the embodiment of the present application. Among them, 5G may also be referred to as New Radio (NR).

A neighbor distance eps is defined, and a minimum number MinPts per cluster. Further defined: v ═ V₁，V₂，...，V_n}，E＝{E₁，E₂，...，E_n}，

C＝{C₁，C₂，...，C_n}，F＝{F₁，F₂，...，F_n}，x＝{X₁，X₂，...，X_n}。

Where V is each point in the data (n points in the data), E is a list of neighbor points for each point, and Ei is i E [1, n ]]Each dot has m_iC is the serial number of the cluster, F and X are state values needed by subsequent calculation, K is also a variable needed by the subsequent calculation, and the initial value is 0.

multithreading (multithreading) refers to a technique for implementing concurrent execution of multiple threads from software or hardware. The computer with multithreading capability can execute more than one thread at the same time due to the hardware support, thereby improving the overall processing performance. Systems with this capability include symmetric multiprocessors, multi-core processors, and chip-level multiprocessing or simultaneous multi-threaded processors. In a program, these independently running program fragments are called "threads" (threads), and the concept of programming using them is called "multithreading".

Each thread calculates the distance between all points in Vi and V, and if the distance is less than eps, adds the corresponding point to Ei. If Ei contains at least MinPts points, Vi is a core point, let Ci equal i, Fi equal 1, if Vi is not a core point, let Ci equal infinity, Fi equal 0. Xi is uniformly 0.

In computer programming, a basic concept is to control multiple tasks simultaneously. Many programming problems require that the program be able to stop working at hand, instead handle other problems, and return to the main process. This can be achieved in a number of ways. Initially, programmers in machine low-level languages written "interrupt service routines" and the main process was suspended by hardware-level interrupts. While this is a useful approach, the programmed program is difficult to migrate, thereby posing another type of costly problem. Interrupts are necessary for tasks that are very real-time. However, for many other problems, it is only necessary to divide the problem into independently running program segments so that the entire program can respond to the user's request more quickly.

if Fi is equal to 1, the cluster number Cj of each of the neighbors in cluster numbers Ci and Ei of Vi is checked, if Ci is not equal to Cj, let Ci and Cj equal the minimum value between them, and let Xi equal 1, Ki also equal 1.

After step S103, K is 1, the values of F and X are interchanged, K is reset to 0, and step S103 is repeated; if, after step S103, K is 0, clustering is complete and Vi belongs to the cluster corresponding to Ci, if the value of Ci is infinite, Vi belongs to noise and is omitted.

The algorithm flow of DBSCAN is as follows: firstly, defining a neighbor distance eps and a minimum number MinPts of each cluster; finding out the adjacent points (the points with the distance within eps) of each point, and marking the adjacent points containing more than MinPts as core points; all core points are combined into a cluster if adjacent points are also the core points; other points belong to the cluster of the core point if any neighboring points are core points, and are marked as noise and omitted if none of the neighboring points are core points.

The disadvantage of the existing algorithm of DBSCAN is that the computational complexity is O (N)²) I.e., the square of the number of dots, is very time consuming as the number of dots increases.

The invention utilizes a multi-thread processor (such as GPU, multi-core CPU and the like) to simultaneously process all points in the data to accelerate the calculation speed. Referring to FIG. 2, each thread calculates the distance between all points in Vi and V, and if the distance is less than eps, adds the corresponding point to Ei. If Ei contains at least MinPts points, Vi is a core point, let Ci equal i, Fi equal 1, if Vi is not a core point, let Ci equal infinity, Fi equal 0, Xi equal 0. If Fi is equal to 1, the cluster number Cj of each of the neighbors in cluster numbers Ci and Ei of Vi is checked, if Ci is not equal to Cj, let Ci and Cj equal the minimum value between them, and let Xi equal 1, Ki also equal 1. When K is 0, clustering is complete and Vi belongs to the cluster corresponding to Ci, if the value of Ci is infinite, Vi belongs to noise and is omitted.

In the above description of the step modification method in the embodiment of the present application, the following description of the apparatus in the embodiment of the present application refers to fig. 3, and an embodiment of the data processing apparatus in the embodiment of the present application includes:

a configuration unit 301, configured to configure preset parameters, where the preset parameters are n sets of parameters, n is a positive integer greater than or equal to 2, and n threads correspond to the n sets of parameters respectively;

a first processing unit 302, including the n threads to process the n sets of parameters respectively, so as to obtain n neighboring point sets;

the second processing unit 303 includes that the n threads process the n neighboring point groups respectively to complete clustering.

Optionally, the second processing unit 303 is specifically configured to calculate, by the n threads, n clusters according to the n neighboring point groups; and the n threads finish clustering processing according to the n clusters.

Optionally, the second processing unit is specifically configured to, when there is equality between adjacent clusters, complete clustering processing by the n threads.

Each unit can execute the clustering data processing method based on multithreading shown in any one of the embodiments in fig. 1, which is not described herein again specifically, and according to multithreading, the calculation speed of clustering data processing is increased.

Fig. 3 above describes the data processing apparatus in the embodiment of the present application in detail from the perspective of a modular functional entity, and the apparatus for clustering data processing based on multithreading in the embodiment of the present application is described in detail from the perspective of hardware processing.

Fig. 4 is a schematic structural diagram of an apparatus for cluster data processing based on multiple threads according to an embodiment of the present application, where the apparatus 400 for cluster data processing based on multiple threads may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 410 (e.g., one or more processors) and a memory 420, and one or more storage media 430 (e.g., one or more mass storage devices) storing applications 433 or data 432. Memory 420 and storage medium 430 may be, among other things, transient or persistent storage. The program stored on the storage medium 430 may include one or more modules (not shown), each of which may include a series of instruction operations in the apparatus 400 for multi-thread based clustered data processing. Still further, the processor 410 may be configured to communicate with the storage medium 430 to execute a series of instruction operations in the storage medium 430 on the apparatus 400 for multi-thread based clustered data processing.

The apparatus 400 for multi-thread based clustered data processing may also include one or more power supplies 440, one or more wired or wireless network interfaces 430, one or more input-output interfaces 460, and/or one or more operating systems 431, such as Wimdows Server, Nmc OS X, Umix, Limux, FreeBSD, and the like. Those skilled in the art will appreciate that the apparatus architecture of the multithreading-based clustering data processing illustrated in fig. 4 does not constitute a limitation on the apparatus of the multithreading-based clustering data processing, and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform the methods and steps of any of the embodiments of fig. 1 or 2 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described herein are intended to comprise one or more of these and/or any other suitable types of memory.

Another embodiment of the invention provides a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a processor, cause the processor to perform a ground filtering method of the above method embodiment. For example, the methods and steps described above in either of the embodiments of fig. 1 or fig. 2 are performed.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. With this in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer electronic device (which may be a personal computer, a server, or a network electronic device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

The embodiments of the present application also provide a communication device, which includes one or more processors, one or more memories, and one or more transceivers (each including a transmitter Tx and a receiver Rx) connected via a bus. One or more transceivers are connected to one or more antennas. The one or more memories include computer program code. The transceiver may perform the functions of the receiving unit or the transmitting unit, and the transceiver may be a separate receiver and transmitter.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In the examples provided herein, it is to be understood that the disclosed methods may be practiced otherwise than as specifically described without departing from the spirit and scope of the present application. The present embodiment is an exemplary example only, and should not be taken as limiting, and the specific disclosure should not be taken as limiting the purpose of the application. For example, some features may be omitted, or not performed.

The technical means disclosed in the present application is not limited to the technical means disclosed in the above embodiments, and includes technical means formed by any combination of the above technical features. It should be noted that, for those skilled in the art, without departing from the principle of the present application, several improvements and modifications can be made, and these improvements and modifications are also considered to be within the scope of the present application.

The method, the apparatus, the device and the storage medium for processing the clustering data based on multiple threads provided by the embodiment of the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A multithreading-based clustering data processing method is characterized by comprising the following steps:

configuring preset parameters, wherein the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads respectively correspond to the n groups of parameters;

the n threads respectively process the n groups of parameters to obtain n adjacent point groups;

and the n threads respectively process the n adjacent point groups to finish clustering processing.

2. The multithreading-based clustering data processing method of claim 1, wherein the n threads respectively process the n neighboring point groups to complete clustering processing comprises:

the n threads calculate n clusters according to the n neighbor point groups;

and the n threads finish clustering processing according to the n clusters.

3. The multithreading-based clustering data processing method of claim 2, wherein the n threads completing clustering processing according to the n clusters comprises:

and when the adjacent clusters are equal, the n threads calculate to finish the clustering processing.

4. The multithreading-based clustering data processing method according to any one of claims 1 to 3, wherein the preset parameters include: neighbor distance and minimum number per cluster.

5. A data processing apparatus, characterized by comprising:

the configuration unit is used for configuring preset parameters, the preset parameters are n groups of parameters, n is a positive integer greater than or equal to 2, and n threads correspond to the n groups of parameters respectively;

the first processing unit comprises n threads which respectively process the n groups of parameters to obtain n neighbor point groups;

and the second processing unit comprises the n threads which respectively process the n adjacent point groups to finish clustering processing.

6. The data processing apparatus according to claim 5, wherein the second processing unit is specifically configured to compute, by the n threads, n clusters from the n neighbor sets;

and the n threads finish clustering processing according to the n clusters.

7. The multithreading-based clustered data processing method of claim 6, wherein the second processing unit is specifically configured to perform clustering by the n threads when there is an equality between adjacent clusters.

8. A multithreading-based clustered data processing system, the multithreading-based clustered data processing system comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the apparatus for multithreading-based clustering data processing to perform the method for multithreading-based clustering data processing as recited in any one of claims 1-4.

9. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1-4.

10. A computer program product, characterized in that the computer program product comprises a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 4.