CN110659276A

CN110659276A - Computer data statistical system and statistical classification method thereof

Info

Publication number: CN110659276A
Application number: CN201910910589.1A
Authority: CN
Inventors: 张琪; 宋仪轩; 刘苗
Original assignee: Jiangsu Healthcare Big Data Protection And Development Co Ltd
Current assignee: Jiangsu Healthcare Big Data Protection And Development Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-01-07

Abstract

The invention relates to the technical field of data statistics, in particular to a computer data statistics system and a statistical classification method thereof. The data cleaning and classifying device comprises a data acquisition unit, a data cleaning unit, a data classifying unit and a data storage unit, wherein the data acquisition unit is used for acquiring front-end data, the data cleaning unit is used for carrying out data cleaning operation on the acquired data, the data classifying unit is used for classifying the data according to data types, and the data storage unit is used for storing the classified data. According to the computer data statistical system and the statistical classification method thereof, the data acquisition unit is arranged to realize the integrity of data acquisition and prevent the data acquisition from being lost, the data classification unit is arranged to realize the classification of data according to data types, and the classified data is stored through the data storage unit to realize the classified storage of data, so that the data calling and searching are facilitated.

Description

Computer data statistical system and statistical classification method thereof

Technical Field

The invention relates to the technical field of data statistics, in particular to a computer data statistics system and a statistical classification method thereof.

Background

With the arrival of the big data era, the quality of data statistics is more and more important, the data statistics is realized by adopting a distributed architecture system, but the existing data statistics cannot preprocess data when being collected at the front end, so that the data types are inconsistent, even the data loss phenomenon occurs, the integrity of data collection is influenced, and meanwhile, when the data statistics is stored, the data statistics cannot be classified and stored according to the relevance between the data, so that the data statistics is not convenient for later-stage searching.

Disclosure of Invention

The present invention is directed to a computer data statistics system and a statistical classification method thereof, so as to solve the problems in the background art.

In order to achieve the above object, in one aspect, the present invention provides a computer data statistics system, including a data acquisition unit, a data cleaning unit, a data classification unit and a data storage unit, where the data acquisition unit is configured to acquire front-end data, the data cleaning unit is configured to perform data cleaning operation on the acquired data, the data classification unit is configured to classify the data according to data type, and the data storage unit is configured to store the classified data, and the data statistics system has the following processes:

s1, collecting front-end data through a collection node;

s2, carrying out data cleaning processing on the acquired data;

s3, classifying the cleaned data;

and S4, storing the classified data.

Preferably, the data acquisition unit acquires data according to the following flow:

s11, front-end data acquisition, wherein the front-end data is acquired through an acquisition node;

s12, data signal conditioning, wherein the analog output of each acquisition node is respectively subjected to signal conversion so as to adapt to the requirement of the input end of the analog/digital converter on the input signal;

s13, storing the sampling signals, converting the continuous signals into discontinuous sampling signals, and converting the discontinuous sampling signals into continuous signals;

s14, converting the analog quantity signal into a digital quantity signal;

and S15, processing the sampled digital signals by digital signal processing.

Preferably, the sampled signal is stored using a unit pulse sequence function, which is expressed by the following formula:

preferably, the data cleansing unit comprises the following modules:

a first module: the error correcting module corrects the data error form;

and a second module: a delete duplicate entry module that deletes duplicate records or duplicate fields present in the data;

and a third module: the unified specification module is used for unifying data specifications and abstracting out consistent content;

and a module IV: the correction logic module is used for determining the logic, conditions and caliber of each source system and correcting the acquisition logic of the abnormal source system;

and a fifth module: the conversion construction module is used for carrying out standardization processing on the data;

and a module six: the data compression module is used for maintaining the integrity and the accuracy of the original data set and reorganizing the data according to a certain algorithm and a certain mode on the premise of not losing useful information;

and a seventh module: the data supplementing module is used for supplementing the data of the incomplete data;

and a module eight: and the data discarding module deletes abnormal data in the data.

Preferably, the data storage unit flow is as follows:

s41, establishing a cloud environment storage system, and establishing a large-scale cloud environment data storage system according to related storage nodes;

s42, decomposing the data processing tasks in the cloud environment data storage system into small tasks, and decomposing a large set area of data into small areas;

s43, data parallel processing, parallel processing a plurality of processing tasks.

Preferably, the data parallel processing formula is as follows:

suppose R is a large amount of data to be stored, having a k-element attribute, A₁,A₂,A_i,A_kRepresenting attributes of the mass of data, with Ai being the mass of data stored on the mth node;

wherein the large amount of data R is represented as:

a method for statistical classification of computer data, comprising any one of the above computer data statistical systems, comprising the steps of:

s31, preprocessing source data, and providing management of algorithm learning samples and management of selecting an optimal algorithm;

s32, data distribution processing and analysis, and resources are reasonably distributed according to the processing capacity of different processors;

s33, integrating the classification results, integrating the results processed by different processors, and adopting a classification integration formula as follows:

wherein, P_cFor accuracy, N is the number of processors.

Preferably, in S31, the source data preprocessing specifically includes the following steps:

s311, filtering and extracting source data, namely filtering and extracting source data information;

s312, learning sample selection, wherein data are randomly sampled, so that the learning samples can fully reflect the integral distribution of the required classification data, and the set learning samples are respectively extracted from different source data according to the distribution of the source data;

s313, comparing sample results, classifying the samples respectively through processors of different algorithm functions in the distributed system, comparing classification results, counting the accuracy of different algorithms on the same sample, and making result data;

s314, selecting the optimal algorithm, comparing the accuracy of different algorithms in detail, and selecting the optimal algorithm as the main algorithm for classifying the data.

Compared with the prior art, the invention has the beneficial effects that:

1. in the computer data statistical system and the statistical classification method thereof, a data acquisition unit is arranged to acquire front-end data through an acquisition node and perform signal conditioning, sampling signal storage, analog-to-digital signal conversion and digital signal processing operations on the acquired data, so that the integrity of data acquisition is realized and the data acquisition is prevented from being lost.

2. In the computer data statistical system and the statistical classification method thereof, a data cleaning unit is arranged, and data is cleaned through an error correcting module, a repeated item deleting module, a unified specification module, a correction logic module, a conversion construction module, a data compression module, a data supplementing module and a data discarding module, so that the data error rate is reduced, and meanwhile, the data occupation amount is reduced.

3. In the computer data statistical system and the statistical classification method thereof, the data classification unit is arranged to classify data according to data types, and the classified data is stored through the data storage unit, so that the data is classified and stored, and the data is convenient to call and search.

Drawings

FIG. 1 is a block diagram of a data statistics system unit of the present invention;

FIG. 2 is a flow chart of a data statistics system of the present invention;

FIG. 3 is a flow chart of a data acquisition unit of the present invention;

FIG. 4 is a schematic diagram of the data signal conditioning operation of the present invention;

FIG. 5 is a block diagram of a data cleansing unit according to the present invention;

FIG. 6 is a flow chart of a data storage unit of the present invention;

FIG. 7 is a flow chart of a data sorting unit according to the present invention;

FIG. 8 is a flow chart of source data preprocessing of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1 to 6, the present invention provides a computer data statistics system, which includes a data acquisition unit, a data cleaning unit, a data classification unit, and a data storage unit, wherein the data acquisition unit is configured to acquire front-end data, the data cleaning unit is configured to perform a data cleaning operation on the acquired data, the data classification unit is configured to classify the data according to data types, the data storage unit is configured to store the classified data, and a flow of the data statistics system is as follows:

s1, collecting front-end data through a collection node;

s2, carrying out data cleaning processing on the acquired data;

s3, classifying the cleaned data;

and S4, storing the classified data.

In this embodiment, the data acquisition process of the data acquisition unit is as follows:

s14, converting the analog quantity signal into a digital quantity signal;

and S15, processing the sampled digital signals by digital signal processing.

Wherein, the collection node is the sensor, and the sensor includes temperature sensor, humidity transducer, image sensor, sound sensor etc. is convenient for select suitable collection node according to the kind of front end, gathers the data that the front end needs to gather.

Further, the data signal conditioning function is to perform signal conversion on the analog output of each sensor respectively, so that the analog output of each sensor is adapted to the requirement of the input end of the analog/digital converter for the input signal, and the functions of the signal conditioning module generally include: the principle of the static processing of signal switching, signal conversion, signal amplification, calibration, linearization, compensation and the like is shown IN fig. 4, IN the diagram, a sensor signal is accessed from a J-IN port, then switches S1-S5 are selected according to the signal type, a conditioned signal is obtained from a J-OUT port and sent to an A/D conversion module, wherein: DGND, VDD denote the sensor-side digital power supply; AGND, V +5, and V-5 represent sensor-side analog power supplies, and the equivalent circuit equations in the standard signal mode are as follows:

V_O-＝0 (3)

in the formula, R_xIs the on-resistance of MAX383, R_wAQW21X, when R is₇＝R₈，R_w＜R₂Obtained by the formulas (1) and (2):

V_O+-V_O-＝(1/2)·R₂/(R₁+R₂)·V_A+ (4)

specifically, the sampled signal is stored by using a unit pulse sequence function, and the formula is as follows:

in the step of converting the sampling signal into the continuous signal, a zero-order retainer is adopted to convert the sampling signal into a signal which keeps a constant value between two continuous sampling moments, namely, in the interval of T e [ nT, (n +1) T ], the output value of the zero-order retainer is always kept as x (nT).

In a further aspect, the data cleansing unit includes the following modules:

a first module: the error correcting module is used for correcting data value errors, data type errors, data coding errors, data format errors, data abnormal errors, dependence conflicts and multi-value errors;

and a second module: the repeated item deleting module deletes repeated records or repeated fields in the data, and the basic idea of judging the repeated items is 'sorting and merging', firstly sorting the records in the database according to a certain rule, and then detecting whether the records are repeated or not by comparing whether the adjacent records are similar or not;

and a fifth module: the conversion construction module is used for carrying out standardized processing on data and comprises data type conversion, data semantic conversion, data granularity conversion, table/data splitting, row-column conversion, data discretization, data standardization, new field refinement and attribute construction;

and a module six: the data compression module maintains the integrity and accuracy of the original data set, reorganizes data according to a certain algorithm and a certain mode on the premise of not losing useful information, and complex data analysis and data calculation of large-scale data generally consume a large amount of time, so that reduction and compression of the data are needed before the reorganization and the compression, the data scale is reduced, interactive data mining can be faced, and information feedback is carried out on comparison data before and after the data mining. Therefore, the data mining on the simplified data set is obviously higher in efficiency, and the mined result is basically the same as the result obtained by using the original data set;

and a seventh module: the data supplementing module is used for supplementing the data of the incomplete data, the data supplementation comprises a supplementation missing value and a supplementation null value, the missing value refers to the condition that the data originally must exist but actually does not have the data, and the null value refers to the condition that the data possibly exist actually;

and a module eight: and the data discarding module deletes abnormal data in the data, wherein the types of the discarded data comprise whole deletion and variable deletion, the whole deletion refers to deletion of a sample containing a missing value, and the variable deletion can be considered if an invalid value and a missing value of a certain variable are many and the variable is not particularly important for the problem to be researched, so that the number of variables for analysis is reduced, and the sample amount is not changed.

It should be noted that the data storage unit flow is as follows:

The data parallel processing formula is as follows:

wherein the large amount of data R is represented as:

example 2

Referring to fig. 7-8, the present invention provides a computer data statistical classification method, including any one of the above computer data statistical systems, including the following steps:

wherein, P_cFor accuracy, N is the number of processors.

In S31, the source data preprocessing specifically includes the following steps:

In S312, in the learning sample selection, assuming that the total number of samples is N, and the sub K sample sets are { N1, N2, … }, the sample selection may randomly select M × Nk/N data from the K samples to be processed respectively according to the set total number M of samples to be recombined to obtain a sample set required by machine learning.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The utility model provides a computer data statistical system, includes data acquisition unit, data cleaning unit, data classification unit and data memory cell, its characterized in that: the data collection unit is used for collecting front-end data, the data cleaning unit is used for carrying out data cleaning operation on the collected data, the data classification unit is used for classifying the data according to data types, the data storage unit is used for storing the classified data, and the data statistical system has the following flow:

s1, collecting front-end data through a collection node;

s2, carrying out data cleaning processing on the acquired data;

s3, classifying the cleaned data;

and S4, storing the classified data.

2. The computer data statistics system of claim 1, wherein: the data acquisition unit acquires data in the following flow:

s14, converting the analog quantity signal into a digital quantity signal;

and S15, processing the sampled digital signals by digital signal processing.

3. The computer data statistics system of claim 2, wherein: the sampled signal is stored and described by a unit pulse sequence function, and the formula is as follows:

4. the computer data statistics system of claim 1, wherein: the data cleaning unit comprises the following modules:

a first module: the error correcting module corrects the data error form;

5. The computer data statistics system of claim 1, wherein: the data storage unit flow is as follows:

6. The computer data statistics system of claim 5, wherein: the data parallel processing formula is as follows:

wherein the large amount of data R is represented as:

7. a method of statistical classification of computer data comprising the computer data statistics system of any of claims 1-6, comprising the steps of:

wherein, P_cFor accuracy, N is the number of processors.

8. The statistical classification method of computer data according to claim 7, characterized in that: in S31, the source data preprocessing specifically includes the following steps: