CN110457293A - Data processing method based on flag bit - Google Patents

Data processing method based on flag bit Download PDF

Info

Publication number
CN110457293A
CN110457293A CN201910566238.3A CN201910566238A CN110457293A CN 110457293 A CN110457293 A CN 110457293A CN 201910566238 A CN201910566238 A CN 201910566238A CN 110457293 A CN110457293 A CN 110457293A
Authority
CN
China
Prior art keywords
data
flag bit
processing method
missing
measurement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910566238.3A
Other languages
Chinese (zh)
Inventor
沈佳
邹岳琳
王天军
何伟
马斌
王晓磊
尼加提
张建业
卿松
尹蕊
刘昆
张龙军
明涛
郭江涛
李雅洁
李豫芹
李凯
李坤源
胡美慧
王巧莉
罗义旺
李金湖
陈强
潘建笠
陈奎印
郎超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Network Xinjiang Electric Power Co Ltd
National Network Xinjiang Electric Power Co Ltd Information And Communication Co
State Grid Information and Telecommunication Co Ltd
State Grid Xinjiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
National Network Xinjiang Electric Power Co Ltd
National Network Xinjiang Electric Power Co Ltd Information And Communication Co
State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Network Xinjiang Electric Power Co Ltd, National Network Xinjiang Electric Power Co Ltd Information And Communication Co, State Grid Information and Telecommunication Co Ltd filed Critical National Network Xinjiang Electric Power Co Ltd
Priority to CN201910566238.3A priority Critical patent/CN110457293A/en
Publication of CN110457293A publication Critical patent/CN110457293A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of data processing methods based on flag bit, comprising: step 1000, obtains data acquisition system Z={ zi| i ∈ [1, n] },N is the number of data in data set Z,For ziIn j-th of data field, m ziMiddle data field quantity,ForFlag bit, work as fieldWhen missing,It is 0;Work as fieldWhen complete,It is 1;ciFor data type;tiFor ziThe timestamp of middle measurement acquisition data;Step 2000, according to the flag bit of the dataCalculate the missing degree s of data;Step 3000, according to the missing degree of the data and data type ci, data are handled.

Description

Data processing method based on flag bit
Technical field
The present invention relates to a kind of data processing methods based on flag bit.
Background technique
It is built by the information systems of many years, many mechanisms have had oneself more perfect information system, have been used for Meet the different information requirement of every field, especially some large corporations have been completed that large data center is built, realize Unified data sharing and data fusion, has accumulated the data in magnanimity production run, and pushes forward common data resource pool comprehensively Building-up work, to share in data set, analysis and utilization provides advantage.As informatization and application deepen continuously, Data caused by information system have become the treasure of each mechanism, thus the data how each information system generated into Row data cleansing, improves the quality of data, and mining data resource value has become the important process in major agency information engineering One of.
Among these, from the point of view of time dimension, data cleansing is work first.What data cleansing was also seen from name goes out Exactly " the washing off " of " dirty ", refer to discovery and correct last one of program of identifiable mistake in data file, including checks Data consistency handles invalid value and missing values etc..Because the data in data warehouse are the collection of the data towards a certain theme It closes, these data extract from multiple operation systems and include historical data, and thus the unavoidable data having are Wrong data, the data having have conflict between each other, and data that are these mistakes or having conflict are clearly that we are undesired, claim For " dirty data ".We will be according to certain rules " dirty data " " washing off ", and here it is data cleansings.
During data cleansing, in order to improve cleaning efficiency, loss of vital data is prevented, it is also necessary to according to the category of data Property is treated with a certain discrimination, for some unessential data, can simply be abandoned;For some data with time attribute, Such as successively repeatedly how the data of measurement acquisition carry out it using the data of time correlation if a certain secondary shortage of data Filling, being one is worth the technical issues of studying;For some data containing inner link, such as the temperature of consecutive variations, How humidity etc. is found its boundary value by historical data and reasoning from logic, eliminates possible error, so as to better logarithm According to being cleaned, and data cleansing field now.
Summary of the invention
In order to solve the above technical problems, the invention proposes a kind of data processing methods based on flag bit, comprising:
Step 1000, the data including flag bit are obtained, wherein whether the flag bit lacks for mark data.
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
The present invention can clean data according to the missing degree and data type of data, improve the strong of data Strong property and reliability, lay a solid foundation for subsequent data analysis and data mining.
Detailed description of the invention
Fig. 1 is the flow chart of the specific embodiment of the invention one.
Fig. 2 is the schematic diagram of the present invention the first measurement acquisition data.
Fig. 3 is the schematic diagram of the present invention the second measurement acquisition data.
Fig. 4 is the flow chart of the specific embodiment of the invention two.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element Structure.Therefore, the following detailed description should not be understood from restrictive sense.
As shown in Figure 1, according to the first aspect of the invention, a kind of data processing method based on flag bit is proposed, Include:
Step 1000, obtain include flag bit data, wherein the flag bit is for the corresponding field of mark data No missing.
In one embodiment, flag bit represents whether the corresponding field in data lacks using 0 and 1, such as works as data Third bit flag position be 0 when, represent data third field missing, when the third bit flag position of data be 1 when, represent data Third field it is complete.Come that corresponding mark is arranged for each field of data it is, for example, possible to use the method for metadata description Position.
In another embodiment, the data acquisition system Z={ z including flag bit is obtainedi| i ∈ [1, n] }, whereinN is the number of data in data set Z,For ziIn j-th of data word Section, m ziMiddle data field quantity,ForFlag bit, work as fieldWhen missing,It is 0;Work as fieldCompletely When,It is 1;ciFor data type;Such as whenWhen, represent ziThird field missing, beWhen, represent zi's Third field is complete;tiFor ziThe timestamp of middle measurement acquisition data.
According to the present invention, the data include measurement acquisition data, and the measurement acquisition data are set by manual or automaticization Standby measurement collects.The measurement acquisition data include the first measurement acquisition data and the second measurement acquisition data.Wherein, such as Shown in Fig. 2, the first measurement acquisition data are that the measurement for acquiring data including periodic measurement in data field portion acquires data; As shown in figure 3, the second measurement acquisition data are recorded as the measurement being made of periodic measuring data acquisition data.Periodically survey Amount data refer to the data every Fixed Time Interval measurement acquisition, such as daily humidity, and temperature, network per minute connect per hour Connect beats.The first measurement acquisition data and the second measurement acquisition data further include timestamp, and the timestamp is for remembering The time of record measurement acquisition data, particularly, the timestamp can also be used in the measurement acquisition week of identification cycle measurement data Phase.
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Further, the missing degree includes the first missing degree and the second missing degree, the first missing degree s1= m0, the second missing degreeWherein m0The flag bit number that all values for the data are 0;m1For the data All values be 1 flag bit number.
Shortage of data situation is measured respectively using the first missing degree and the second missing degree, it can be from absent field number Shortage of data situation is investigated with two aspects of ratio of absent field quantity and entire fields quantity, more accurately determines that data lack The severity of situation is lost, can be pointed the direction for the data processing of next step.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
Further, if s1Greater than first threshold k1Or s2Greater than second threshold k2, then directly the data are lost It abandons.
If s1Less than or equal to first threshold k1And s2Less than or equal to second threshold k2, and ziData are acquired for the first measurement, SoWherein, v ziThe measurement of middle missing acquires data field, vwThe timestamp of place record is less than ti, vwInstitute Identical with v in field, u is the quantity of randomly selected first measurement acquisition data, u ∈ N, preferred u >=3, more preferably 5.
Wherein, k1∈ N+, k1>=2, it is preferred that k1For the function of data field number m, it is furthermore preferred thatIts InFor the symbol that rounds up;k2For positive real number, k2∈R+, k2>=1, more preferably 2.
In another embodiment, if s1Less than or equal to first threshold k1And s2Less than or equal to second threshold k2, and ziIt is first Measurement acquisition data, v ziThe measurement of middle missing acquires data field, then v=f (v1,v2), i.e. v is v1And v2Function, v、v1And v2The record at place is different, and the field at place is identical, and | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|,j≠1, 2, i, j ∈ [1, n1], n1For the first measurement data number, t in the data acquisition system1With t2Respectively v1And v2The record at place Timestamp, tjFor the timestamp of j-th of first measurement data in the data acquisition system.
Missing data is filled using the method for measurement of correlation acquisition data mean value, it can be by adjusting random choosing The data bulk selected, the reliability of Lai Tigao data filling;Directly missing data is filled out using measurement of correlation acquisition data It fills, can be improved the efficiency of data filling, reduce the complexity of data processing, answered under big data environment for data processing method With providing possibility.
As shown in figure 4, according to the second aspect of the invention, additionally providing a kind of data processing side based on flag bit Method, wherein step 1000 is identical as the first aspect of the invention, and the step 2000 of first aspect of the present invention and 3000 is replaced It is changed to:
Step 2000, according to the flag bit of the data, the missing degree of data is obtained.
Further, degree is lackedWherein m1The mark that all values for the data are 1 Will position number, m0The flag bit number that all values for the data are 0, α, β are weighting coefficient, alpha+beta=1, preferably α, β =0.5.
Measure shortage of data situation using missing degree s, can faster accurate judgement shortage of data situation, can It points the direction for the data processing of next step, while weighting coefficient can be adjusted according to the different situations of data, improve data The fitness of processing method, so that data processing method application of the invention is more extensive.
Step 3000, according to the missing degree and data type of the data, data are cleaned.
Further, if s is greater than third threshold value k3, which is abandoned.
If s is less than or equal to third threshold value k3, and ziFor the first measurement acquisition data, v ziThe measurement of middle missing acquires number According to field, then v=f (v1,v2), i.e. v is v1And v2Function, v, v1And v2The record at place is different, and the field at place is identical, And | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|, j ≠ 1,2, i, j ∈ [1, n1], n1It is surveyed in the data acquisition system first Measure data amount check, t1With t2Respectively v1And v2The timestamp of the record at place, tjFor j-th first measurements in the data acquisition system The timestamp of data.
If s is less than or equal to third threshold value k3, and ziData are acquired for the second measurement, then using the following method pair ziCarry out data cleansing:
IfSo IfSoIfMissing, thenAnd i ≠ j.K is natural number, k ∈ [1,10], preferably 3.k3For positive real number, k3>=0.8, More preferably 1.5.
It is advantageous in that using the method for this data cleansing, 99% or more improper data can be rejected, guarantee number According to validity, while missing data can be filled, and data search need not be carried out, improve the effect of data processing Rate.
In addition, according to disclosed specification of the invention, other realizations of the invention are for those skilled in the art Significantly.The various aspects of embodiment and/or embodiment can be used for system of the invention individually or with any combination In method.Specification and example therein should be only be regarded solely as it is exemplary, the actual scope of the present invention and spirit by appended Claims indicate.

Claims (9)

1. a kind of data processing method based on flag bit, comprising:
Step 1000, data acquisition system Z={ z is obtainedi| i ∈ [1, n] },N is The number of data in data set Z,For ziIn j-th of data field, m ziMiddle data field quantity,ForMark Position, works as fieldWhen missing,It is 0;Work as fieldWhen complete,It is 1;ciFor data type;tiFor ziMiddle measurement acquires number According to timestamp;
Step 2000, according to the flag bit of the dataObtain the missing degree of data;
Step 3000, according to the missing degree of the data and data type ci, data are handled.
2. the data processing method according to claim 1 based on flag bit, which is characterized in that missing degreeWherein m1It is 1 for valueNumber, m0It is 0 for valueNumber, α, β are weighting coefficient, α+ β=1.
3. the data processing method according to claim 2 based on flag bit, which is characterized in that step 3000 include: as Fruit s is greater than third threshold value k3, which is abandoned.
4. the data processing method according to claim 3 based on flag bit, which is characterized in that step 3000 further include:
If s is less than or equal to third threshold value k3, and ciIndicate ziData are acquired for the first measurement, then v=f (v1,v2), i.e., v is v1And v2Function, v ziThe measurement of middle missing acquires data field;Wherein, v, v1And v2The record at place is different, the word at place Duan Xiangtong, and | t1-ti| < | tj-ti|,|t2-ti| < | tj-ti|, j ≠ 1,2, i, j ∈ [1, n1], n1For in the data acquisition system First measurement data number, t1With t2Respectively v1And v2The timestamp of the record at place, tjIt is j-th in the data acquisition system The timestamp of one measurement data.
5. the data processing method according to claim 4 based on flag bit, which is characterized in that step 3000 further include:
If s is less than or equal to third threshold value k3, and ciIndicate ziData are acquired for the second measurement, then using the following method To ziCarry out data cleansing:
IfSo IfSoIfMissing, thenAnd i ≠ j.
6. the data processing method based on flag bit according to any one of claim 3-5, which is characterized in that k3It is positive Real number, k3≥0.8。
7. the data processing method according to claim 5 based on flag bit, which is characterized in that k is natural number, k ∈ [1, 10]。
8. the data processing method according to claim 6 based on flag bit, which is characterized in that k3=1.5.
9. the data processing method according to claim 7 based on flag bit, which is characterized in that k=3.
CN201910566238.3A 2019-06-27 2019-06-27 Data processing method based on flag bit Pending CN110457293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910566238.3A CN110457293A (en) 2019-06-27 2019-06-27 Data processing method based on flag bit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910566238.3A CN110457293A (en) 2019-06-27 2019-06-27 Data processing method based on flag bit

Publications (1)

Publication Number Publication Date
CN110457293A true CN110457293A (en) 2019-11-15

Family

ID=68481214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910566238.3A Pending CN110457293A (en) 2019-06-27 2019-06-27 Data processing method based on flag bit

Country Status (1)

Country Link
CN (1) CN110457293A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909256A (en) * 2019-11-20 2020-03-24 华育昌(肇庆)智能科技研究有限公司 Artificial intelligence information filtering system for computer

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909256A (en) * 2019-11-20 2020-03-24 华育昌(肇庆)智能科技研究有限公司 Artificial intelligence information filtering system for computer

Similar Documents

Publication Publication Date Title
Baillon et al. Testing ambiguity models through the measurement of probabilities for gains and losses
CN106952167B (en) Catering industry friend edge-connecting influence prediction method based on multiple linear regression
Punt Selecting management methodologies for marine resources, with an illustration for southern African hake
Fox Mortality, migration, and rural transformation in sub-Saharan Africa's urban transition
CN110957015A (en) Missing value filling method for electronic medical record data
Pollock et al. Work orders: analysing employment histories using sequence data
CN107273234A (en) A kind of time series data rejecting outliers and bearing calibration based on EEMD
CN109376218B (en) Thesis influence assessment method based on cascade
CN110457293A (en) Data processing method based on flag bit
CN110321493A (en) A kind of abnormality detection of social networks and optimization method, system and computer equipment
CN110516129B (en) Data processing method and device
CN113839835A (en) Top-k flow accurate monitoring framework based on small flow filtering
White et al. Fast approximation algorithms for finding node-independent paths in networks
CN109264023A (en) Initial fatigue quality appraisal procedure based on analysis of uncertainty
Weber et al. A method to evaluate the reliability of social media data for social network analysis
Tausczik et al. Distributed knowledge in crowds: Crowd performance on hidden profile tasks
CN110084423A (en) A kind of link prediction method based on local similarity
JP6180371B2 (en) Topology estimation apparatus and program
CN117472894A (en) Method for cleaning communication data based on data link
Butterworth et al. Inferences on the dynamics of Southern Hemisphere minke whales from ADAPT analyses of catch-at-age information
Marcum et al. Ego-centered cognitive social structures of close personal networks in the United States
Gittleman et al. Supertrees: using complete phylogenies in comparative biology
Kibanov et al. On the evolution of contacts and communities in networks of face-to-face proximity
Del Corral et al. A Country-Level Efficiency Analysis of the 2016 Summer Olympic Games in Rio: A Complete Picture.
Singh Where do parties live? Electoral institutions, party incentives, and the dimensionality of politics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination