CN115481199A - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN115481199A
CN115481199A CN202211276872.1A CN202211276872A CN115481199A CN 115481199 A CN115481199 A CN 115481199A CN 202211276872 A CN202211276872 A CN 202211276872A CN 115481199 A CN115481199 A CN 115481199A
Authority
CN
China
Prior art keywords
data
processed
data processing
group
groups
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211276872.1A
Other languages
Chinese (zh)
Inventor
张春烽
张俊锋
刘伟业
冯闪
李登高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianren Healthcare Big Data Technology Co Ltd
Original Assignee
Lianren Healthcare Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianren Healthcare Big Data Technology Co Ltd filed Critical Lianren Healthcare Big Data Technology Co Ltd
Priority to CN202211276872.1A priority Critical patent/CN115481199A/en
Publication of CN115481199A publication Critical patent/CN115481199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a storage medium. The method comprises the following steps: receiving two groups of data to be processed, and respectively determining a data processing group corresponding to each group of data to be processed; for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group; and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed. According to the embodiment of the invention, through the application of the distributed system in two overall difference analyses, the overall memory of the computing system is enlarged, the analysis and processing of mass data are realized, and the data processing efficiency is improved.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
With the increasing degree of social informatization, big data is gradually evolving into indispensable strategic resources. Each industry generates a large amount of data every day, so that the processing and analysis of the big data become a competitive hot spot among the industries.
Although there are statistical analysis tools for processing and analyzing big data, in the analysis of difference significance of two populations, the existing method is usually performed on one computer: firstly, data is read into a computer memory, and then relevant calculation steps are completed in the memory.
The above method is usually limited to a single computer, and limited to the memory capacity of the single computer, and in the process of processing a large data set, the memory overflow often occurs, which results in a data processing process that is not smooth and inefficient.
Disclosure of Invention
The invention provides a data processing method, a device, equipment and a storage medium, which enlarge the capacity of processing data and realize the efficient and smooth completion of the calculation work of a large data set by applying a distributed system in data analysis processing.
In a first aspect, an embodiment of the present invention provides a data processing method, which is applied to a distributed system, where the distributed system includes multiple distributed nodes, and the method includes:
receiving two groups of data to be processed, and respectively determining a data processing group corresponding to each group of data to be processed; the data to be processed comprises a plurality of subdata to be processed, and the number of data processing groups corresponding to each group of data to be processed is the same;
for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group;
and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed.
In a second aspect, an embodiment of the present invention further provides a data processing apparatus, which is applied in data processing, and the data processing apparatus includes:
the first data processing module is used for receiving two groups of data to be processed and respectively determining a data processing group corresponding to each group of data to be processed;
the second data processing module is used for respectively sending the data processing groups corresponding to the current data to be processed to the corresponding distributed nodes for each group of data to be processed, so that the distributed nodes determine the data processing results of the corresponding data processing groups;
and the third data processing module is used for determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed.
In a third aspect, an embodiment of the present invention further provides a data processing apparatus, where the apparatus includes: a processor or a plurality of processors; and memory communicatively coupled to the one or more processors; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of the embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions are configured to enable a processor to implement any one of the data processing methods in the embodiment of the present invention when the computer instructions are executed.
According to the technical scheme of the embodiment of the invention, a distributed system is used for receiving two groups of data to be processed and respectively determining a data processing group corresponding to each group of data to be processed; for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group; and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed. By applying the distributed system in two overall difference analyses, the overall memory of the computing system is expanded, the analysis and processing of mass data are realized, and the data processing efficiency is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, and are only used for explaining the present invention, but not for limiting the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention;
fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device implementing the data processing method according to the embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "target," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Before the technical solution of the embodiment of the present invention is introduced, an application scenario is first described in detail. The embodiment of the invention utilizes a distributed system to process two groups of data generated under different conditions and analyze whether the influence of different conditions on the data is obviously different, such as whether the effect of taking different brands of antihypertensive drugs on the blood pressure control of patients is obviously different, whether the influence of different teaching methods on the grades of two classes is obvious, and the like. In the process of determining whether two groups of data have significant difference, a method used is T test, and the T test is a commonly used test method for testing whether the difference between two populations is significant by using a T distribution curve. The T distribution curve is close to the normal distribution curve, the abscissa of the T distribution curve is T statistic, and the T statistic is used for measuring the deviation of the sample mean value obtained by randomly sampling two populations for multiple times. In the T test, a hypothesis test method is generally used: assuming no difference between the two mutually independent populations, i.e. the original hypothesis H 0 :μ 1 =μ 2 (ii) a Alternative hypothesis H 11 ≠μ 2 . Wherein mu 1 、μ 2 The mean of two populations, respectively, is used as the sample mean since the population data is too large
Figure BDA0003895394930000041
Instead of the overall mean, a computational analysis is performed.
The probability corresponding to the T statistic, i.e., the P value. In a T-distribution curve, the area between the curve and the abscissa is 1, and each T statistic corresponds to a probability, representing the probability of the occurrence of a T value when the original hypothesis is set.
With P = a as the boundary, when P<And alpha, the P value is in a preset range. The T value obtained by the sampling belongs to a small probability event or an extreme event. Then the original hypothesis H is rejected if an extreme event occurs on the premise that the original hypothesis is true 0 Accept alternative hypothesis H 1 I.e. there is a significant difference between the two populations. Otherwise, the former hypothesis H is accepted 0 . Optionally, the size of α is set according to an actual situation, and this embodiment is not limited herein.
Example one
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, where the present embodiment is applicable to a case where a large data set is analyzed and processed by using a distributed system, the method may be executed by a data processing apparatus, the apparatus may be implemented in a form of hardware and/or software, and the apparatus may be configured in a computer.
As shown in fig. 1, the method includes:
s110, receiving two groups of data to be processed, and respectively determining a data processing group corresponding to each group of data to be processed.
The data to be processed comprises a plurality of subdata to be processed, and the number of data processing groups corresponding to each group of data to be processed is the same.
The data to be processed refers to sample statistical data obtained by random sampling in the population. Further, there is no relationship between the subjects of each of the two populations. Illustratively, the blood pressure values of several patients are obtained by random sampling from the population taking the antihypertensive drug A, and are recorded as a sample a, and the same operation is performed on the population taking the antihypertensive drug B to obtain a sample B. That is, the data to be processed are the blood pressure values in the two sets of samples; the subjects are patients in the samples, and the subjects in the two samples are independent of each other, for example, the patients taking the antihypertensive drug A cannot take the antihypertensive drug B at the same time.
The data processing groups are obtained by grouping the data to be processed respectively, and the number of the data processing groups divided by the two groups of data to be processed is the same. The number of data processing groups is set according to actual conditions, and the embodiment is not limited herein. Based on the above example, the sample a and the sample b are divided into the same data processing groups, and the sample a is used for explanation here, the sample a is now divided into five data processing groups a1, a2, a3, a4, and a5, the number of sub-data to be processed in each processing group may be equal or unequal, and this embodiment is not limited herein.
And S120, for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group.
The distributed nodes refer to computer servers used for analyzing and processing data.
Specifically, the data processing groups corresponding to the two data to be processed are allocated to the corresponding distributed nodes, and a formula is set as required to calculate the corresponding result. On the basis of the above example, the processing modes of the sample a and the sample b are the same, and we take a data processing group in the sample a as an example, a plurality of distributed nodes are preset, now the data processing group a1 is sent to one of the nodes, the average of the sub-data to be processed in the group a1 is calculated at the node, and finally the average obtained by processing the node is summarized with the average obtained by other nodes.
Optionally, the distributed node processes at least one data processing group.
Specifically, one distributed node processes data in at least one data processing group, and has the advantages of fully utilizing server resources, improving data processing efficiency and enlarging data processing capacity. Optionally, one distributed node may also process a plurality of data processing groups, and on the basis of the above example, a1 and a2 of five data processing groups corresponding to the sample a are sent to one distributed node for processing, so as to obtain the average of two data processing groups respectively.
The method has the advantages that each node respectively processes the corresponding data processing group, the division of labor cooperation among the nodes is realized, and the data processing efficiency of the distributed system is effectively improved.
S130, determining whether two groups of data to be processed are reserved or not based on the data processing result corresponding to each group of data to be processed and the number of the sub data to be processed in each group of data to be processed.
Wherein, the data processing result is obtained according to a preset formula.
Specifically, based on the data processing result obtained by each node and the number of the to-be-processed subdata in each group of to-be-processed data group, a corresponding statistical result is calculated. And analyzing the significance of the difference between the two groups of data to be processed according to the statistical result, and finally determining whether to reserve the two groups of data to be processed. Illustratively, on the basis of the above embodiment, taking the sample a as an example, the average value of the data processing group corresponding to the sample a in each node is sent to one of the nodes, and the average value of the sample a population is calculated at the node. And finally, calculating the variance, the T statistic and the P value at the node based on the sample capacity of the sample a and the sample b and the mean value of the two samples, and performing sample difference significance analysis by combining theoretical knowledge of hypothesis test to determine whether to keep two groups of data to be processed.
According to the technical scheme provided by the embodiment of the invention, two groups of data to be processed are received by using a distributed system, and data processing groups corresponding to each group of data to be processed are respectively determined; for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group; and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed. The problem that the processing and analysis of the large data set are limited by a single computer and memory overflow occurs is solved, the calculation work of the large data set is efficiently and smoothly completed, and a reference method is provided for the processing and analysis of the related large data set in the future.
Example two
Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention, and on the basis of the foregoing embodiment, data processing groups corresponding to current data to be processed may be respectively sent to corresponding distributed nodes, so that the distributed nodes determine data processing results of the corresponding data processing groups for further refinement.
As shown in fig. 2, the method includes:
s210, receiving two groups of data to be processed, and respectively determining a data processing group corresponding to each group of data to be processed.
The data to be processed comprises a plurality of subdata to be processed, and the number of data processing groups corresponding to each group of data to be processed is the same.
S220, for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node performs mean processing on each sub data to be processed in the data processing group to obtain a first mean value; or summing each sub-data to be processed in the data processing group to obtain an accumulated value of the data processing group.
If the above steps perform mean processing on each to-be-processed subdata of the data processing group, then S230 is executed;
if the accumulated value processing is performed on each sub-data to be processed in the data processing group, S240 is executed.
And S230, for each group of data to be processed, acquiring a first average value corresponding to the current group of data to be processed, and determining a target average value.
Wherein, the first mean value refers to the mean value corresponding to each data processing group, and on the basis of the above example, the data processing is performed due to the distributed nodesThe processing modes of the groups are the same, and taking one data processing group as an example, the sub-data to be processed in the data group a1 to be processed are summed at the corresponding distributed nodes, and then the average value processing is carried out to obtain a first average value corresponding to the data group a1 to be processed
Figure BDA0003895394930000081
In this way, the first mean value corresponding to each data processing group in the sample a is recorded as
Figure BDA0003895394930000082
In addition, i refers to each data processing group. The target mean is a value obtained by summing the first means of all the data processing groups corresponding to the sample and then performing mean processing. For example, since the distributed system processes two sample data in the same manner, taking one sample as an example, the first mean values corresponding to all data processing groups in the sample a are summed, and after the mean value processing, the target mean value corresponding to the sample a is obtained and recorded as the target mean value
Figure BDA0003895394930000083
Similarly, the target mean corresponding to sample b is recorded as
Figure BDA0003895394930000084
Optionally, the target mean value formula is:
Figure BDA0003895394930000085
n: the number of data processing groups corresponding to a group of data to be processed;
Figure BDA0003895394930000086
the sum of the first mean values of n data processing groups corresponding to the first group of data to be processed;
Figure BDA0003895394930000091
the first of n data processing groups corresponding to the second group of data to be processedThe sum of the values.
S240, for each group of data to be processed, acquiring an accumulated value corresponding to the current group of data to be processed, and determining a target mean value based on the accumulated value.
The accumulated value is obtained by summing the sub-data to be processed in each group of data to be processed. Based on the above example, since the distributed node has the same processing mode for each to-be-processed data group, taking one of the to-be-processed data groups as an example, the to-be-processed sub-data in the to-be-processed data group a1 are summed to obtain an accumulated value corresponding to the to-be-processed data group a1, which is denoted as Δ x 11 In this way, the accumulated value corresponding to each data processing group in the sample a is recorded as Δ x 1i (ii) a The accumulated value corresponding to each data processing set in sample b is denoted as Δ x 2i . The target mean value is a result obtained by summing accumulated values corresponding to each data processing group in each sample and then performing mean value processing based on the number of the sub-data to be processed in each sample. For example, since the distributed node has the same processing mode for two sample data, taking one sample as an example, the accumulated values corresponding to all data processing groups in the sample a are summed, and then the average value is processed according to the number of samples in the sample a to obtain a target average value corresponding to the sample a, and the target average value is recorded as
Figure BDA0003895394930000092
Similarly, the target mean corresponding to sample b is recorded as
Figure BDA0003895394930000093
Optionally, the target mean value formula is:
Figure BDA0003895394930000094
n 1 : a first set of sub-data to be processed;
n 2 : a second set of sub-data to be processed;
Figure BDA0003895394930000095
the sum of the accumulated values of n data processing groups corresponding to the first group of data to be processed;
Figure BDA0003895394930000096
and the sum of the accumulated values of the n data processing groups corresponding to the second group of data to be processed.
And S250, determining the mean variance based on the target mean.
Wherein the mean variance is a parameter reflecting the degree of deviation of the samples from the mean of the two populations, S is commonly used 2 To indicate. Based on the above example, the processing manner of each group of data to be processed is the same, now taking sample a as an example, the mean variance of sample a reflects the dispersion degree of each blood pressure value in sample a to the overall blood pressure mean of sample a, and the mean variance of sample a is denoted as S 1 2 The mean variance of the sample b is denoted as S 2 2
The mean variance formula used in this embodiment is:
Figure BDA0003895394930000101
and S260, determining the target degree of freedom based on the number of the sub-data to be processed.
The target degree of freedom refers to two groups of data to be processed as a whole, and the number of variable data is represented by v. The target degree of freedom affects the concentration degree of the T distribution curve, the greater the degree of freedom, the more concentrated the T distribution curve, the smaller the degree of freedom, and the more dispersed the T distribution curve.
Specifically, in a sample, if the sample capacity is m, the corresponding degree of freedom of the sample is m-1. Illustratively, the sample capacity of the sample a is 100, the sample capacity of the sample b is 200, the degree of freedom of the sample a is 99, the degree of freedom of the sample b is 199, and further, the target degree of freedom corresponding to the two samples as a whole is 100-1+200-1=298.
S270, determining a statistical result based on the target degree of freedom, the mean variance, the target degree of freedom and the target function.
The objective function is a function for calculating a statistical result P value according to the T value and the target degree of freedom v in the R language. In this embodiment, the functions that can be used are:
function 1: p = pt (abs (T), v) and/or function 2: p =2pt (abs (T), v).
Wherein function 1 is used for single-tailed inspection, i.e. one-sided inspection of the T-distribution. Generally, based on common sense or other factors, there is a good reason to consider the magnitude relationship between two overall means, and then a unilateral test is carried out on the T distribution, and whether two samples have significant difference can be obtained through the magnitude relationship between the P value and alpha. Based on the above example, it is known by the professional to analyze the components of the medicine that the effect of the antihypertensive drug A is slightly better than that of the antihypertensive drug B, and there is a good reason to find that the two overall mean relations are: mu.s 1 >μ 2 Due to the randomness of the sampling, the extreme case is not avoided, namely:
Figure BDA0003895394930000102
and is
Figure BDA0003895394930000103
And
Figure BDA0003895394930000104
the deviation is extremely large, and the left part of the T distribution symmetry axis is inspected by using single-tail inspection under the above condition, namely left single-side inspection; otherwise, the right one-sided test is used.
Function 2 is used for two-tailed inspection, i.e. both sides of the T-distribution are inspected simultaneously. The method is generally used for the condition that the magnitude of the relation between two populations is unknown, after a statistical result P value is obtained, a conclusion whether two independent samples have significant difference can be obtained through the magnitude relation between the P value and alpha, and when the data quantity is large enough, the magnitude of the mean value of the two populations can be judged according to the positive and negative of the statistical quantity T value.
Specifically, a distributed system is used for distributed calculation, the result is collected to a distributed node, and the capacity n of two samples is calculated at the distributed node 1 、n 2 And a target degree of freedom v, calculating T statistic value based on the processing result and the mean variance, and selectingAnd selecting a proper function to combine the target freedom v to determine a statistical result P value so as to analyze the difference significance between the two sample mean values.
It should be noted that, since the extreme cases at both ends of the T-profile are analyzed, the α value in the two-tailed test is half of that in the single-tailed test.
Illustratively, on the basis of the above example, the value of α is preset to be 0.05, and it is assumed that there is no significant difference between sample a and sample b, i.e. the original hypothesis H 0 :μ 1 =μ 2 Alternative hypothesis H 11 ≠μ 2 . And processing the two groups of data by a distributed system to obtain a P value. The two test methods have similar logic in result analysis, and the two-tailed test is taken as an example, in the two-tailed test, if P is<0.025, the original hypothesis H is rejected 0 Accept alternative hypothesis H 11 ≠μ 2 That is, the significant difference between the sample a and the sample b indicates that the drug effects of the two antihypertensive drugs are significantly different, and then which of the two antihypertensive drugs is better can be determined according to the positive and negative of the T statistic. If P>0.025, the original hypothesis H is accepted 0 The samples a and b have no significant difference, and the two hypotensor drugs have basically the same drug effect.
According to the technical scheme provided by the embodiment, a distributed system is used for receiving two groups of data to be processed and respectively determining a data processing group corresponding to each group of data to be processed; for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group; and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed. The problem that a single computer cannot smoothly complete the calculation work of a large data set is solved, the data processing capacity is enlarged, and the data processing efficiency is improved. And providing a reference method for the processing analysis of the future related large data set.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:
and the first data processing module is used for receiving the two groups of data to be processed and respectively determining a data processing group corresponding to each group of data to be processed. The second data processing module: and the distributed node is used for respectively sending the data processing groups corresponding to the current data to be processed to the corresponding distributed nodes so as to ensure that the distributed nodes determine the data processing results of the corresponding data processing groups. And the third data processing module is used for determining whether to reserve the two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of subdata to be processed in each group of data to be processed.
On the basis of the above technical solutions, the second data processing module includes,
the first mean value calculating unit is used for carrying out mean value processing on each to-be-processed subdata in the data processing group to obtain a first mean value; or the accumulated value calculating unit is used for summing each sub-data to be processed in the data processing group to obtain the accumulated value of the data processing group.
On the basis of the above technical solutions, the third data processing module further comprises,
the target mean value calculating unit is used for acquiring a first mean value corresponding to the current group of data to be processed and determining a target mean value; or the target mean value calculating unit is used for acquiring an accumulated value corresponding to the current group of data to be processed and determining a target mean value based on the accumulated value;
a mean variance calculation unit for determining a mean variance based on the target mean;
the target degree of freedom calculation unit is used for determining the target degree of freedom based on the number of the sub data to be processed;
and the statistical result calculating unit is used for determining a statistical result based on the target mean value and the mean variance of each group of data to be processed and the number of the sub-data to be processed.
According to the technical scheme of the embodiment, two groups of data to be processed are received by using a distributed system, and data processing groups corresponding to each group of data to be processed are respectively determined; for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group; and determining whether to reserve two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed. The problem that the processing and analysis of the large data set are limited by a single computer and the memory overflows is solved, and the calculation work of the large data set is efficiently and smoothly completed. And providing a reference method for the processing analysis of the future related large data set.
Example four
Fig. 4 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention.
The data processing device 40 may be an electronic device intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from a storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data necessary for the operation of the electronic apparatus 40 can also be stored. The processor 41, the ROM 42, and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.
A number of components in the electronic device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Processor 41 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 41 performs the various methods and processes described above, such as data processing methods.
In some embodiments, the data processing method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into the RAM 43 and executed by the processor 41, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the processor 41 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. It should be understood that the various data processes shown above may be calculated not exclusively. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A data processing method applied to a distributed system, the distributed system comprising a plurality of distributed nodes, the method comprising:
receiving two groups of data to be processed, and respectively determining a data processing group corresponding to each group of data to be processed; the data to be processed comprises a plurality of subdata to be processed, and the number of data processing groups corresponding to each group of data to be processed is the same;
for each group of data to be processed, respectively sending the data processing group corresponding to the current data to be processed to the corresponding distributed node, so that the distributed node determines the data processing result of the corresponding data processing group;
and determining whether to reserve the two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed.
2. The method of claim 1, wherein the distributed nodes process at least one data processing group.
3. The method of claim 1, wherein the distributed node determines data processing results for respective data processing groups, comprising:
carrying out mean value processing on each subdata to be processed in the data processing group to obtain a first mean value; or the like, or, alternatively,
and summing the sub-data to be processed in the data processing group to obtain an accumulated value of the data processing group.
4. The method of claim 1, wherein the data processing result includes a first average value, and the determining whether to retain the two sets of data to be processed based on the data processing result corresponding to each set of data to be processed and the number of sub-data to be processed in each set of data to be processed comprises:
for each group of data to be processed, acquiring a first mean value corresponding to the current group of data to be processed, determining a target mean value, and determining a mean value variance based on the target mean value;
determining a statistical result based on the target mean value and the mean variance of each group of data to be processed and the number of the sub-data to be processed;
and if the statistical result is within a preset range corresponding to the T distribution, the two groups of data to be processed are reserved.
5. The method of claim 1, wherein the data processing results include accumulated values, and the determining whether to retain the two sets of data to be processed based on the data processing results corresponding to the sets of data to be processed and the number of sub-data to be processed in each set of data to be processed comprises:
for each group of data to be processed, acquiring an accumulated value corresponding to the current group of data to be processed, and determining a target mean value based on the accumulated value;
determining a mean variance based on the target mean;
and determining a statistical result based on the target mean value and the mean variance of each group of data to be processed and the number of the sub-data to be processed, so as to determine whether to reserve the two groups of data to be processed based on the statistical result.
6. The method of claim 4 or 5, wherein the determining a statistical result based on the target mean, the mean variance and the number of the sub-data to be processed comprises:
determining a target degree of freedom based on the number of the sub-data to be processed;
and determining the statistical result based on the target degree of freedom, the mean variance, the target degree of freedom and the target function.
7. A data processing apparatus for use in a distributed system, the distributed system including a plurality of distributed nodes, the apparatus comprising:
the first data processing module: the data processing device is used for receiving two groups of data to be processed and respectively determining a data processing group corresponding to each group of data to be processed;
the second data processing module: the distributed node processing system comprises a distributed node, a data processing group and a data processing group, wherein the distributed node is used for sending the data processing group corresponding to current data to be processed to the corresponding distributed node respectively so as to enable the distributed node to determine the data processing result of the corresponding data processing group;
a third data processing module: and the method is used for determining whether to reserve the two groups of data to be processed or not based on the data processing result corresponding to each group of data to be processed and the number of the subdata to be processed in each group of data to be processed.
8. The apparatus of claim 7, wherein the distributed node processes at least one data processing group.
9. A data processing apparatus, characterized in that the apparatus comprises one or more processors; and a memory communicatively coupled to the one or more processors; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-6.
10. A computer-readable storage medium, characterized in that it stores computer instructions for causing a processor to implement the data processing method of any of claims 1-6 when executed.
CN202211276872.1A 2022-10-18 2022-10-18 Data processing method, device, equipment and storage medium Pending CN115481199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211276872.1A CN115481199A (en) 2022-10-18 2022-10-18 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211276872.1A CN115481199A (en) 2022-10-18 2022-10-18 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115481199A true CN115481199A (en) 2022-12-16

Family

ID=84395214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211276872.1A Pending CN115481199A (en) 2022-10-18 2022-10-18 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115481199A (en)

Similar Documents

Publication Publication Date Title
CN110647447A (en) Abnormal instance detection method, apparatus, device and medium for distributed system
WO2022121216A1 (en) Data processing method and apparatus, terminal, and readable storage medium
CN114999665A (en) Data processing method and device, electronic equipment and storage medium
CN115481199A (en) Data processing method, device, equipment and storage medium
CN115599687A (en) Method, device, equipment and medium for determining software test scene
CN115344627A (en) Data screening method and device, electronic equipment and storage medium
CN115438056A (en) Data acquisition method, device, equipment and storage medium
CN115438007A (en) File merging method and device, electronic equipment and medium
CN115422028A (en) Credibility evaluation method and device for label portrait system, electronic equipment and medium
CN113760176A (en) Data storage method and device
CN115801589B (en) Event topological relation determination method, device, equipment and storage medium
CN115221339B (en) Method, device, equipment and medium for constructing regional knowledge graph
WO2023231184A1 (en) Feature screening method and apparatus, storage medium, and electronic device
CN115965276A (en) Index set determination method and device, electronic equipment and storage medium
CN115600819A (en) Risk assessment method and device, electronic equipment and storage medium
CN115757306A (en) Information compression method, device, equipment and storage medium
CN117650967A (en) Multi-cluster index processing method, system, electronic equipment and storage medium
CN115774648A (en) Abnormity positioning method, device, equipment and storage medium
CN115391618A (en) Data identification method and device, electronic equipment and storage medium
CN115563103A (en) Multi-dimensional aggregation method, system, electronic device and storage medium
CN116801001A (en) Video stream processing method and device, electronic equipment and storage medium
CN116225746A (en) Method, apparatus, device, storage medium and program product for determining system problem
CN115904839A (en) Memory bandwidth detection method, device, equipment and storage medium
CN115602245A (en) Method, device and equipment for screening fluorescence map and storage medium
CN115292606A (en) Information pushing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination