CN110457293A - Data processing method based on flag bit - Google Patents
Data processing method based on flag bit Download PDFInfo
- Publication number
- CN110457293A CN110457293A CN201910566238.3A CN201910566238A CN110457293A CN 110457293 A CN110457293 A CN 110457293A CN 201910566238 A CN201910566238 A CN 201910566238A CN 110457293 A CN110457293 A CN 110457293A
- Authority
- CN
- China
- Prior art keywords
- data
- flag bit
- processing method
- missing
- measurement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of data processing methods based on flag bit, comprising: step 1000, obtains data acquisition system Z={ zi| i ∈ [1, n] },N is the number of data in data set Z,For ziIn j-th of data field, m ziMiddle data field quantity,ForFlag bit, work as fieldWhen missing,It is 0;Work as fieldWhen complete,It is 1;ciFor data type;tiFor ziThe timestamp of middle measurement acquisition data;Step 2000, according to the flag bit of the dataCalculate the missing degree s of data;Step 3000, according to the missing degree of the data and data type ci, data are handled.
Description
Technical field
The present invention relates to a kind of data processing methods based on flag bit.
Background technique
It is built by the information systems of many years, many mechanisms have had oneself more perfect information system, have been used for
Meet the different information requirement of every field, especially some large corporations have been completed that large data center is built, realize
Unified data sharing and data fusion, has accumulated the data in magnanimity production run, and pushes forward common data resource pool comprehensively
Building-up work, to share in data set, analysis and utilization provides advantage.As informatization and application deepen continuously,
Data caused by information system have become the treasure of each mechanism, thus the data how each information system generated into
Row data cleansing, improves the quality of data, and mining data resource value has become the important process in major agency information engineering
One of.
Among these, from the point of view of time dimension, data cleansing is work first.What data cleansing was also seen from name goes out
Exactly " the washing off " of " dirty ", refer to discovery and correct last one of program of identifiable mistake in data file, including checks
Data consistency handles invalid value and missing values etc..Because the data in data warehouse are the collection of the data towards a certain theme
It closes, these data extract from multiple operation systems and include historical data, and thus the unavoidable data having are
Wrong data, the data having have conflict between each other, and data that are these mistakes or having conflict are clearly that we are undesired, claim
For " dirty data ".We will be according to certain rules " dirty data " " washing off ", and here it is data cleansings.
During data cleansing, in order to improve cleaning efficiency, loss of vital data is prevented, it is also necessary to according to the category of data
Property is treated with a certain discrimination, for some unessential data, can simply be abandoned;For some data with time attribute,
Such as successively repeatedly how the data of measurement acquisition carry out it using the data of time correlation if a certain secondary shortage of data
Filling, being one is worth the technical issues of studying;For some data containing inner link, such as the temperature of consecutive variations,
How humidity etc. is found its boundary value by historical data and reasoning from logic, eliminates possible error, so as to better logarithm
According to being cleaned, and data cleansing field now.
Summary of the invention
In order to solve the above technical problems, the invention proposes a kind of data processing methods based on flag bit, comprising:
Step 1000, the data including flag bit are obtained, wherein whether the flag bit lacks for mark data.
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
The present invention can clean data according to the missing degree and data type of data, improve the strong of data
Strong property and reliability, lay a solid foundation for subsequent data analysis and data mining.
Detailed description of the invention
Fig. 1 is the flow chart of the specific embodiment of the invention one.
Fig. 2 is the schematic diagram of the present invention the first measurement acquisition data.
Fig. 3 is the schematic diagram of the present invention the second measurement acquisition data.
Fig. 4 is the flow chart of the specific embodiment of the invention two.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing
Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation
Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off
Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element
Structure.Therefore, the following detailed description should not be understood from restrictive sense.
As shown in Figure 1, according to the first aspect of the invention, a kind of data processing method based on flag bit is proposed,
Include:
Step 1000, obtain include flag bit data, wherein the flag bit is for the corresponding field of mark data
No missing.
In one embodiment, flag bit represents whether the corresponding field in data lacks using 0 and 1, such as works as data
Third bit flag position be 0 when, represent data third field missing, when the third bit flag position of data be 1 when, represent data
Third field it is complete.Come that corresponding mark is arranged for each field of data it is, for example, possible to use the method for metadata description
Position.
In another embodiment, the data acquisition system Z={ z including flag bit is obtainedi| i ∈ [1, n] }, whereinN is the number of data in data set Z,For ziIn j-th of data word
Section, m ziMiddle data field quantity,ForFlag bit, work as fieldWhen missing,It is 0;Work as fieldCompletely
When,It is 1;ciFor data type;Such as whenWhen, represent ziThird field missing, beWhen, represent zi's
Third field is complete;tiFor ziThe timestamp of middle measurement acquisition data.
According to the present invention, the data include measurement acquisition data, and the measurement acquisition data are set by manual or automaticization
Standby measurement collects.The measurement acquisition data include the first measurement acquisition data and the second measurement acquisition data.Wherein, such as
Shown in Fig. 2, the first measurement acquisition data are that the measurement for acquiring data including periodic measurement in data field portion acquires data;
As shown in figure 3, the second measurement acquisition data are recorded as the measurement being made of periodic measuring data acquisition data.Periodically survey
Amount data refer to the data every Fixed Time Interval measurement acquisition, such as daily humidity, and temperature, network per minute connect per hour
Connect beats.The first measurement acquisition data and the second measurement acquisition data further include timestamp, and the timestamp is for remembering
The time of record measurement acquisition data, particularly, the timestamp can also be used in the measurement acquisition week of identification cycle measurement data
Phase.
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Further, the missing degree includes the first missing degree and the second missing degree, the first missing degree s1=
m0, the second missing degreeWherein m0The flag bit number that all values for the data are 0;m1For the data
All values be 1 flag bit number.
Shortage of data situation is measured respectively using the first missing degree and the second missing degree, it can be from absent field number
Shortage of data situation is investigated with two aspects of ratio of absent field quantity and entire fields quantity, more accurately determines that data lack
The severity of situation is lost, can be pointed the direction for the data processing of next step.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
Further, if s1Greater than first threshold k1Or s2Greater than second threshold k2, then directly the data are lost
It abandons.
If s1Less than or equal to first threshold k1And s2Less than or equal to second threshold k2, and ziData are acquired for the first measurement,
SoWherein, v ziThe measurement of middle missing acquires data field, vwThe timestamp of place record is less than ti, vwInstitute
Identical with v in field, u is the quantity of randomly selected first measurement acquisition data, u ∈ N, preferred u >=3, more preferably 5.
Wherein, k1∈ N+, k1>=2, it is preferred that k1For the function of data field number m, it is furthermore preferred thatIts
InFor the symbol that rounds up;k2For positive real number, k2∈R+, k2>=1, more preferably 2.
In another embodiment, if s1Less than or equal to first threshold k1And s2Less than or equal to second threshold k2, and ziIt is first
Measurement acquisition data, v ziThe measurement of middle missing acquires data field, then v=f (v1,v2), i.e. v is v1And v2Function,
v、v1And v2The record at place is different, and the field at place is identical, and | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|,j≠1,
2, i, j ∈ [1, n1], n1For the first measurement data number, t in the data acquisition system1With t2Respectively v1And v2The record at place
Timestamp, tjFor the timestamp of j-th of first measurement data in the data acquisition system.
Missing data is filled using the method for measurement of correlation acquisition data mean value, it can be by adjusting random choosing
The data bulk selected, the reliability of Lai Tigao data filling;Directly missing data is filled out using measurement of correlation acquisition data
It fills, can be improved the efficiency of data filling, reduce the complexity of data processing, answered under big data environment for data processing method
With providing possibility.
As shown in figure 4, according to the second aspect of the invention, additionally providing a kind of data processing side based on flag bit
Method, wherein step 1000 is identical as the first aspect of the invention, and the step 2000 of first aspect of the present invention and 3000 is replaced
It is changed to:
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Further, degree is lackedWherein m1The mark that all values for the data are 1
Will position number, m0The flag bit number that all values for the data are 0, α, β are weighting coefficient, alpha+beta=1, preferably α, β
=0.5.
Measure shortage of data situation using missing degree s, can faster accurate judgement shortage of data situation, can
It points the direction for the data processing of next step, while weighting coefficient can be adjusted according to the different situations of data, improve data
The fitness of processing method, so that data processing method application of the invention is more extensive.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
Further, if s is greater than third threshold value k3, which is abandoned.
If s is less than or equal to third threshold value k3, and ziFor the first measurement acquisition data, v ziThe measurement of middle missing acquires number
According to field, then v=f (v1,v2), i.e. v is v1And v2Function, v, v1And v2The record at place is different, and the field at place is identical,
And | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|, j ≠ 1,2, i, j ∈ [1, n1], n1It is surveyed in the data acquisition system first
Measure data amount check, t1With t2Respectively v1And v2The timestamp of the record at place, tjFor j-th first measurements in the data acquisition system
The timestamp of data.
If s is less than or equal to third threshold value k3, and ziData are acquired for the second measurement, then using the following method pair
ziCarry out data cleansing:
IfSo
IfSoIfMissing, thenAnd i ≠ j.K is natural number, k ∈ [1,10], preferably 3.k3For positive real number, k3>=0.8,
More preferably 1.5.
It is advantageous in that using the method for this data cleansing, 99% or more improper data can be rejected, guarantee number
According to validity, while missing data can be filled, and data search need not be carried out, improve the effect of data processing
Rate.
In addition, according to disclosed specification of the invention, other realizations of the invention are for those skilled in the art
Significantly.The various aspects of embodiment and/or embodiment can be used for system of the invention individually or with any combination
In method.Specification and example therein should be only be regarded solely as it is exemplary, the actual scope of the present invention and spirit by appended
Claims indicate.
Claims (9)
1. a kind of data processing method based on flag bit, comprising:
Step 1000, data acquisition system Z={ z is obtainedi| i ∈ [1, n] },N is
The number of data in data set Z,For ziIn j-th of data field, m ziMiddle data field quantity,ForMark
Position, works as fieldWhen missing,It is 0;Work as fieldWhen complete,It is 1;ciFor data type;tiFor ziMiddle measurement acquires number
According to timestamp;
Step 2000, according to the flag bit of the dataObtain the missing degree of data;
Step 3000, according to the missing degree of the data and data type ci, data are handled.
2. the data processing method according to claim 1 based on flag bit, which is characterized in that missing degreeWherein m1It is 1 for valueNumber, m0It is 0 for valueNumber, α, β are weighting coefficient, α+
β=1.
3. the data processing method according to claim 2 based on flag bit, which is characterized in that step 3000 include: as
Fruit s is greater than third threshold value k3, which is abandoned.
4. the data processing method according to claim 3 based on flag bit, which is characterized in that step 3000 further include:
If s is less than or equal to third threshold value k3, and ciIndicate ziData are acquired for the first measurement, then v=f (v1,v2), i.e., v is
v1And v2Function, v ziThe measurement of middle missing acquires data field;Wherein, v, v1And v2The record at place is different, the word at place
Duan Xiangtong, and | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|, j ≠ 1,2, i, j ∈ [1, n1], n1For in the data acquisition system
First measurement data number, t1With t2Respectively v1And v2The timestamp of the record at place, tjIt is j-th in the data acquisition system
The timestamp of one measurement data.
5. the data processing method according to claim 4 based on flag bit, which is characterized in that step 3000 further include:
If s is less than or equal to third threshold value k3, and ciIndicate ziData are acquired for the second measurement, then using the following method
To ziCarry out data cleansing:
IfSo
IfSoIfMissing, thenAnd i ≠ j.
6. the data processing method based on flag bit according to any one of claim 3-5, which is characterized in that k3It is positive
Real number, k3≥0.8。
7. the data processing method according to claim 5 based on flag bit, which is characterized in that k is natural number, k ∈ [1,
10]。
8. the data processing method according to claim 6 based on flag bit, which is characterized in that k3=1.5.
9. the data processing method according to claim 7 based on flag bit, which is characterized in that k=3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910566238.3A CN110457293A (en) | 2019-06-27 | 2019-06-27 | Data processing method based on flag bit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910566238.3A CN110457293A (en) | 2019-06-27 | 2019-06-27 | Data processing method based on flag bit |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110457293A true CN110457293A (en) | 2019-11-15 |
Family
ID=68481214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910566238.3A Pending CN110457293A (en) | 2019-06-27 | 2019-06-27 | Data processing method based on flag bit |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110457293A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909256A (en) * | 2019-11-20 | 2020-03-24 | 华育昌(肇庆)智能科技研究有限公司 | Artificial intelligence information filtering system for computer |
-
2019
- 2019-06-27 CN CN201910566238.3A patent/CN110457293A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909256A (en) * | 2019-11-20 | 2020-03-24 | 华育昌(肇庆)智能科技研究有限公司 | Artificial intelligence information filtering system for computer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Baillon et al. | Testing ambiguity models through the measurement of probabilities for gains and losses | |
CN106952167B (en) | Catering industry friend edge-connecting influence prediction method based on multiple linear regression | |
Punt | Selecting management methodologies for marine resources, with an illustration for southern African hake | |
Fox | Mortality, migration, and rural transformation in sub-Saharan Africa's urban transition | |
CN110957015A (en) | Missing value filling method for electronic medical record data | |
Pollock et al. | Work orders: analysing employment histories using sequence data | |
CN107273234A (en) | A kind of time series data rejecting outliers and bearing calibration based on EEMD | |
CN109376218B (en) | Thesis influence assessment method based on cascade | |
CN110457293A (en) | Data processing method based on flag bit | |
CN110321493A (en) | A kind of abnormality detection of social networks and optimization method, system and computer equipment | |
CN110516129B (en) | Data processing method and device | |
CN113839835A (en) | Top-k flow accurate monitoring framework based on small flow filtering | |
White et al. | Fast approximation algorithms for finding node-independent paths in networks | |
CN109264023A (en) | Initial fatigue quality appraisal procedure based on analysis of uncertainty | |
Weber et al. | A method to evaluate the reliability of social media data for social network analysis | |
Tausczik et al. | Distributed knowledge in crowds: Crowd performance on hidden profile tasks | |
CN110084423A (en) | A kind of link prediction method based on local similarity | |
JP6180371B2 (en) | Topology estimation apparatus and program | |
CN117472894A (en) | Method for cleaning communication data based on data link | |
Butterworth et al. | Inferences on the dynamics of Southern Hemisphere minke whales from ADAPT analyses of catch-at-age information | |
Marcum et al. | Ego-centered cognitive social structures of close personal networks in the United States | |
Gittleman et al. | Supertrees: using complete phylogenies in comparative biology | |
Kibanov et al. | On the evolution of contacts and communities in networks of face-to-face proximity | |
Del Corral et al. | A Country-Level Efficiency Analysis of the 2016 Summer Olympic Games in Rio: A Complete Picture. | |
Singh | Where do parties live? Electoral institutions, party incentives, and the dimensionality of politics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |