CN102486790A - System and method for filling data missing value - Google Patents

System and method for filling data missing value Download PDF

Info

Publication number
CN102486790A
CN102486790A CN2010105799328A CN201010579932A CN102486790A CN 102486790 A CN102486790 A CN 102486790A CN 2010105799328 A CN2010105799328 A CN 2010105799328A CN 201010579932 A CN201010579932 A CN 201010579932A CN 102486790 A CN102486790 A CN 102486790A
Authority
CN
China
Prior art keywords
data
rows
missing value
hurdle
specific
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105799328A
Other languages
Chinese (zh)
Inventor
曾新穆
谢百恩
苏家辉
许芝华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute for Information Industry
Original Assignee
Institute for Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute for Information Industry filed Critical Institute for Information Industry
Priority to CN2010105799328A priority Critical patent/CN102486790A/en
Publication of CN102486790A publication Critical patent/CN102486790A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a system and a method for filling a data missing value, which are suitable for a data array. The system comprises a storage unit and computing equipment, wherein the storage unit is provided with the data array. The computing equipment is used for finding out complete data columns and missing value data columns from the data array, and then, finding out at least one target data column being approximate to the missing value data columns from all the complete data columns, and known data at corresponding positions are extracted from the target data column for working out estimated data to replace the unknown data of the missing value data columns. Subsequently, a revised data column comprising the estimated data is selected from each missing value data column, then, the rough set of the selected estimated data is found out in the manner of taking the same data as automorphism groups, and values related to the estimated data are found out for working out filled data so as to be filled to the column of original estimated data.

Description

The data missing value fill up system and method
Technical field
The present invention relates to a kind of data filling System and method for, particularly relate to a kind of system and method for filling up in order to polishing data missing value.
Background technology
Many now collection and processing in biology, medical application data often in far-end or different local many data of collecting, converge and put in order or Data Management Analysis.For example; The collection technique of gene data; Nothing more than being to utilize chip or pick-up unit physiological signal with test organisms tissue or collection of biological; Cell, body fluid, the biological physiological signal that moves like animal or plant ... wait multiple different gene representation of data, these gene representation of data can be recorded in the data array in chip or the pick-up unit storage element.
Yet the gene data of as above being given an example is collected, and when collecting the gene representation of data as medical analysis, regular meeting runs into the situation that gene performance value is omitted.When medically omission being arranged at present, analytically promptly can't use, so can be regarded as invalid data with data rows deletion with this missing value in many as if the gene representation of data.Yet data rows is deleted when too much, will cause analyze inaccurate, or the situation that can't analyze, the modal practice is to utilize identical or different chip or pick-up unit to collect the gene representation of data once more again.No matter be to collect the data operation again, or use other chips or pick-up unit, meeting clearly causes the waste of precious medical data.On the other hand, in the data leakage mending technique now, equation of linear regression formula (Linear regression), neural network (Neuralnetwork) and KNN (K-nearest neighborhood) are proposed more.But equation of linear regression formula and neural network are difficult to be applied to classification type data, and if use different benefit value technology, the result who analyzes out will be queried in relevant data array.On the other hand, KNN then is not suitable for the data array of larger data amount, and the time of searching data can be long, and the employable category of institute is too little.
Therefore, how to provide one to be applicable to various data arrays, can not spend long data processing time, and the benefit value method of the low error rate of tool, the problem that should ponder over for manufacturer.
This shows that the collection of above-mentioned existing data and processing obviously still have inconvenience and defective, and demand urgently further improving in method, product structure and use.In order to solve the problem of above-mentioned existence; Relevant manufacturer there's no one who doesn't or isn't seeks solution painstakingly; But do not see always that for a long time suitable design is developed completion; And conventional method and product do not have appropriate method and structure to address the above problem, and this obviously is the problem that the anxious desire of relevant dealer solves.Therefore how to found a kind of system and method for filling up of new data missing value, also becoming the current industry utmost point needs improved target.
Summary of the invention
The objective of the invention is to; Overcome the collection of existing data and handle the defective that exists; And a kind of system and method for filling up of new data missing value is provided, technical matters to be solved is to make it provide a kind of data rows auxiliary ligand of high similarity that utilizes right, to obtain relevant estimated data; With the system and method for filling up of the data missing value of filling up the unknown data hurdle, be very suitable for practicality whereby.
The object of the invention and solve its technical matters and adopt following technical scheme to realize.The system of filling up of a kind of data missing value that proposes according to the present invention; Comprising: a storage element; It stores a data array; This data array comprises a plurality of data rows and a plurality of field, and these data rows comprise a plurality of partial data row and a plurality of missing value data rows, and each missing value data rows comprises at least one unknown data; And a computing equipment, it comprises: a routine analyzer; An and processor; In order to read and to utilize this data array of this parser analysis; Wherein, this processor is found out the approximate at least one target data row of each missing value data rows from these partial data row, takes out at least one given data to extrapolate an estimated data from it; To replace its these indivedual corresponding unknown data and as a plurality of data to be revised; Revise data from this wait again and find out data specific to be revised,, from these fields, select the one first specific data hurdle and the one second specific data hurdle of the approximate order of data variation trend with this specific data variation trend of waiting to revise hurdle, data place; And according to these specific data of waiting to revise data place row; With the identical data is to find out a data rows group with crowd's mode, according to the hurdle combination on this data rows group and this second specific data hurdle, is to divide these data with crowd's mode to classify a plurality of subgroup as with the identical data again; And wherein find out at least one target group of this data rows group of Data Matching; To utilize it to extrapolate a fill data to insert this field specific to be revised, judge again whether should specificly wait to revise data place row has other data to be revised, to determine whether to specify another data specific to be revised to should specificly waiting the data of revising field.
The object of the invention and solve its technical matters and also can adopt following technical measures further to realize.
The system of filling up of aforesaid data missing value; Wherein said this processor is a partial data curve of setting up each partial data row; Set up a missing value data and curves of each missing value data rows; And compare the similarity of each missing value data and curves and these partial data curves, to find out the corresponding at least one approximate target data curve of each missing value data and curves from these partial data curves; And right according to these missing value data and curves and this target data curve respectively, find out each missing value data rows at least one the most approximate approximate target data row.
The system of filling up of aforesaid data missing value; Wherein said this processor is a data rows of judging a particular demographic in this subgroup; With data rows in this data rows group during arbitrary conforming to, judge that this particular demographic is this target group, specifying field to be revised again is the specific data hurdle.
The system of filling up of aforesaid data missing value, the data of wherein said these data rows are the numeric type data, and this fill data is the average of the numerical value in this specific data hurdle of this at least one target group.
The system of filling up of aforesaid data missing value; The data of wherein said these data rows are classification type data; This estimated data is this missing value field under its unknown data hurdle of inserting in advance, the data in this at least one given data of these at least one target data row that it is corresponding.
The object of the invention and solve its technical matters and also can adopt following technical measures further to realize.
The complementing method of aforesaid data missing value, wherein said this step of taking out the approximate at least one target data row of each missing value data rows taking-up that from these partial data row, is respectively comprises: set up each partial data row one partial data curve; Set up each missing value data rows one missing value data and curves: compare the similarity of each missing value data and curves and these partial data curves, to find out the corresponding at least one approximate target data curve of each missing value data and curves from these partial data curves; And right according to these missing value data and curves and this target data curve respectively, find out each missing value data rows at least one the most approximate approximate target data row.
The complementing method of aforesaid data missing value; Wherein said this step of finding out this at least one target group of data rows group of Data Matching from these subgroup comprises: the data rows of a particular demographic when this subgroup; With the arbitrary person of conforming to of data rows in this data rows group, judge that this particular demographic is this target group; And appointment field to be revised is the specific data hurdle.
The complementing method of aforesaid data missing value, the data of wherein said these data rows are the numeric type data, and this fill data is the average of the numerical value in this specific data hurdle of this at least one target group.
The complementing method of aforesaid data missing value; The data of wherein said these data rows are classification type data; This estimated data is this missing value field under its unknown data hurdle of inserting in advance, the data in this at least one given data of these at least one target data row that it is corresponding.
The present invention compared with prior art has tangible advantage and beneficial effect.Can be known that by above technical scheme major technique of the present invention thes contents are as follows: a kind of system of filling up of data missing value, it comprises a storage element and a computing equipment.Storage unit stores has a data array, and data array comprises a plurality of data rows and a plurality of field, and these data rows comprise a plurality of partial data row and a plurality of missing value data rows, and each missing value data rows comprises at least one unknown data.Computing equipment includes a routine analyzer and a processor, and processor is in order to read and to utilize this data array of parser analysis.Wherein, Processor is found out the approximate at least one target data row of each missing value data rows from all partial data row; Take out at least one given data to extrapolate an estimated data, to replace its indivedual each corresponding unknown data and, to find out data specific to be revised from remain to be revised data again from it as a plurality of data to be revised; With the specific data variation trend of waiting to revise hurdle, data place; From all fields, select the one first specific data hurdle and the one second specific data hurdle of the approximate order of data variation trend, and, be to find out a data rows group with crowd's mode with the identical data according to the specific data of waiting to revise data place row; Hurdle according to the data rows group and the second specific data hurdle makes up again; With the identical data is to divide these data with crowd's mode to classify a plurality of subgroup as, and wherein finds out at least one target group of Data Matching data rows group, extrapolates a fill data to insert field specific to be revised to utilize its corresponding specific data of revising field of waiting; Judge again and specificly wait to revise data place row whether other data to be revised are arranged, whether specify another data specific to be revised with decision.
For solving the said method problem, the present invention discloses a kind of complementing method of data missing value, is applicable to a data array, and this data array comprises a plurality of data rows and a plurality of field.The method comprises: from data array, find out a plurality of partial data row and a plurality of missing value data rows, each missing value data rows comprises at least one unknown data; From each partial data row, be respectively each missing value data rows and take out approximate at least one target data row; Put in the field of affiliated missing value data rows according to each unknown data, obtain at least one given data from the target data row that the missing value data rows is corresponding, and utilize given data to extrapolate an estimated data; Each estimated data is replaced its indivedual corresponding unknown data, with as a plurality of data to be revised; Specify data specific to be revised from respectively waiting to revise the data, this is specific waits to revise the data place and classifies one as and revise data rows; According to this specific data variation trend of waiting to revise hurdle, data place; From each field, select one first approximate specific data hurdle of data variation trend; According to the specific data of waiting to revise data place row, be to find out with crowd's mode to comprise a data rows group of revising data rows with the identical data; From each field, select to wait to revise data place column number according to one second approximate specific data hurdle of variation tendency second with specific; According to the specific hurdle combination of waiting to revise the data hurdle, place and the second specific data hurdle, be to divide each data with crowd's mode to classify a plurality of subgroup as with the identical data according to above-mentioned hurdle combination; Find out at least one target group of Data Matching data rows group from each subgroup, in order to extrapolating a fill data to insert field specific to be revised with the corresponding specific data of revising field of waiting of above-mentioned target group; Judge and specificly wait to revise data place row whether other data to be revised are arranged, with decision another data specific to be revised whether.
By technique scheme, the system and method for filling up of data missing value of the present invention has advantage and beneficial effect at least:
In conjunction with Pearson's data related coefficient and rough set, adopt two-part data leakage mending technique, after filling up, the estimated data of setting up high accurancy and precision revises the data of being filled up again, and this is of value to and promotes degree of accuracy and the validity of analyzing.Secondly; This technology can be done the data with missing value and fill up, and many data can be retained, so can be applied to the more data analysis by the data after filling up; But not give up easily; So be able to avoid repeating the operation that the gene representation of data is collected, be of value to the saving medical resource, and save manpower and technical costs.
Above-mentioned explanation only is the general introduction of technical scheme of the present invention; Understand technological means of the present invention in order can more to know; And can implement according to the content of instructions, and for let of the present invention above-mentioned with other purposes, feature and advantage can be more obviously understandable, below special act preferred embodiment; And conjunction with figs., specify as follows.
Description of drawings
Figure 1A is the system block diagram of the embodiment of the invention;
Figure 1B is the complementing method schematic flow sheet of the data missing value of the embodiment of the invention;
Thin portion schematic flow sheet in Figure 1B method that Fig. 1 C and Fig. 1 D are;
Fig. 2 is first kind of data array example figure of the embodiment of the invention;
Fig. 3 is that synoptic diagram is inserted in the discreet value of the data array of one embodiment of the invention;
Fig. 4 is that data array specific of one embodiment of the invention waits to revise data and specify synoptic diagram;
Fig. 5 A is that synoptic diagram is selected on the first specific data hurdle of the data array of one embodiment of the invention;
Fig. 5 B is that the data rows group of the data array of one embodiment of the invention divides synoptic diagram;
Fig. 6 A is that another data rows group of the data array of one embodiment of the invention divides synoptic diagram;
Fig. 6 B is that the subgroup of the data array of one embodiment of the invention is divided synoptic diagram;
Fig. 7 is the corresponding synoptic diagram of group of the data array of one embodiment of the invention;
Fig. 8 is second kind of data array example figure of one embodiment of the invention;
Fig. 9 is that synoptic diagram is inserted in the discreet value of second kind of data array of one embodiment of the invention; And
Figure 10 is that the fill data of second kind of data array of one embodiment of the invention is inserted synoptic diagram.
10: storage element 11: data array
11a: the data array 11b of numeric type: the data array of classification type
20: computing equipment 21: processor
22: routine analyzer 23: the data acquisition device
24: data storage element
71: the unknown data of the data array of numeric type
71 ': the unknown data of the data array of classification type
72,72 ': estimated data
81: not corrected correction data rows
82: data specific to be revised
83 ': the second specific data hurdle, 83: the first specific data hurdles
84: data rows group 85: fill data
97: the 7 subgroup of 94: the 4 subgroup
Embodiment
Reach technological means and the effect that predetermined goal of the invention is taked for further setting forth the present invention; Below in conjunction with accompanying drawing and preferred embodiment; To the data missing value that proposes according to the present invention fill up its embodiment of its product of system and method, method, step, structure, characteristic and effect thereof, specify as after.
Seeing also Figure 1A is the system block diagram of the embodiment of the invention; This system comprises a computing equipment 20 and a storage element 10; Storage one data arrays 11 in this storage element 10 have a processor 21, a data acquisition device 23 and a routine analyzer 22 in the computing equipment 20.Data acquisition device 23 is in order to obtain data array 11 from storage element 10, and processor 21 will utilize routine analyzer 22 to analyze above-mentioned data array 11.Yet data array 11 also is able to be stored in the data storage element 24 of computing equipment 20 by acquisition in advance, for the data array 11 of processor 21 direct reading of data storage elements 24 to carry out the operation of filling up of what follows missing value.
Computing equipment 20 can be the electronic equipment that generally has data-handling capacity, for example various types of computers, PC, notebook computer, server, workstation or PDA etc.Storage element 10 can be to have the element of storage capacity or install for example chip, storer, hard disk, carry-on dish etc.; Also can be arranged in other devices or and integrate, for example all types of pick-up units (producing all kinds of detection data behind the detection of biological corpse or other object for laboratory examination and chemical testing), health-care box (collecting all kinds of physiological signals of human body), signal collection device (collection various types of signal) etc. with other devices.
Please cooperate Figure 1A and consult the complementing method schematic flow sheet that Figure 1B is the data missing value of the embodiment of the invention simultaneously; Its leakage value that is applicable to data array is filled a vacancy; Thin portion schematic flow sheet in Figure 1B method of please consulting Fig. 1 C and Fig. 1 D simultaneously and being, and Fig. 2 is that to insert synoptic diagram, Fig. 4 be that data array specific of one embodiment of the invention waited to revise data to specify synoptic diagram, Fig. 5 A be that to select synoptic diagram, Fig. 5 B be that to divide synoptic diagram, Fig. 6 A be that to divide synoptic diagram, Fig. 6 B be that to divide synoptic diagram and Fig. 7 be that the corresponding synoptic diagram of group of the data array of one embodiment of the invention is beneficial to understanding for the subgroup of the data array of one embodiment of the invention for another data rows group of the data array of one embodiment of the invention for the data rows group of the data array of one embodiment of the invention for the first specific data hurdle of the data array of one embodiment of the invention for discreet value that first kind of data array synoptic diagram, Fig. 3 of one embodiment of the invention is the data array of one embodiment of the invention.
Like Figure 1A; The method comprises two stages; One for utilizing Pearson's data related coefficient (PearsonCorrelation Coefficient PCC) tentatively inserts the unknown data hurdle with estimated data, and one for to utilize rough set will find out the approximate value of missing value; To revise former estimated data, the method flow process is following:
From data array, find out a plurality of partial data row and a plurality of missing value data rows, each missing value data rows comprises at least one unknown data (step S110).Like Fig. 2, be example with the data array 11a of numeric type, data array 11a comprises a plurality of data rows and a plurality of field.
Tentation data array 11a comprises 10 data rows; Wherein the 4th, 5,9 data are classified the partial data row as; 1st, 2,3,6,7,8,10 data are classified the missing value data rows as; Each missing value data rows comprises at least one unknown data 71 (among the figure with 0 representative), is that the unknown data hurdle of the 3rd hurdle, the 2nd data rows is that the unknown data hurdle of the 1st hurdle, the 3rd data rows is that the unknown data hurdle of the 4th hurdle, the 6th data rows is the 2nd hurdle and the 3rd hurdle like the unknown data hurdle of the 1st data rows ... by that analogy.
From each partial data row, be respectively each missing value data rows and take out approximate at least one target data row (step S120).This step is please consulted the data rows curve comparison schematic flow sheet that Fig. 1 C is the embodiment of the invention simultaneously, and its step is following:
Set up a partial data curve (step S121) of each partial data row, set up a missing value data and curves (step S122) of each missing value data rows.
In this explanation, analyze earlier each partial data row, with the data map of partial data row in the data axle of two dimension, to obtain the corresponding partial data curve of each partial data row.Identical, analyze each missing value data rows, ignoring under its condition with unknown data, with the data map of missing value data rows in the data axle of two dimension, to obtain the corresponding missing value data and curves of each missing value data rows.
Compare the similarity of each missing value data and curves and partial data curve, to find out the corresponding at least one the most approximate target data curve (step S123) of each missing value data and curves from all partial data curves.In this, the partial data curve that each missing value data curve comparison is all after each missing value data and curves and partial data curve are compared one by one, can produce the approximate rate of the corresponding missing value data and curves of partial data curve.Afterwards, according to these approximate rates, each missing value data and curves is able to by to going out at least one approximate target data curve.
Afterwards; right according to these missing value data and curves and target data curve; Be able to find out the approximate at least one the most approximate target data row (step S124) of each missing value data rows; Aforesaid target data curve is target data row described herein and is mapped in two-dimentional axes of coordinates generation, so the pairing of missing value data and curves and target data curve can oppositely obtain the pairing of missing value data rows and target data row.
Yet; Step S120 also can compare the mode of difference each other with same order column number value; Comparing the data difference degree of missing value data rows and each partial data row, and then compare the data similarity of missing value data rows and each partial data row, that obtains that missing value data rows and the partial data of the high similarity of tool be listed as is right; And the method knows usually that by the tool of data comparison technical field the knowledgeable is known, and does not narrate at this.
Put in the field of affiliated missing value data rows according to each unknown data; Obtain at least one given data from the target data row that the missing value data rows is corresponding; And utilize this given data to extrapolate an estimated data (step S130); Estimated data being replaced its indivedual corresponding unknown data, with as a plurality of data to be revised (step S140).
In this step, estimated data is missing value data rows under its unknown data hurdle of inserting in advance, the mean values of the given data of its corresponding target data row.For example, Fig. 2 and Fig. 3 are that the data of data rows are the numeric type data, and the 1st data rows has unknown data 71 in the 3rd hurdle, and the partial data of the most approximate the 1st data rows is classified the 5th data rows as, thus the 3rd hurdle of the 1st data rows promptly with 3 (3/1=3) as estimated data 72.And for example, the 2nd data rows has unknown data 71 in the 1st hurdle, and classifies the 4th data rows as near the partial data of the 2nd data rows, thus the 1st hurdle of the 2nd data rows promptly with 4 (4/1=4) as estimated data 72.And for example, the 3rd data rows has unknown data 71 in the 4th hurdle, and classifies the 4th data rows and the 9th data rows as near the partial data of the 3rd data rows, so promptly ((2+2)/2=2) is as estimated data 72 with 2 on the 4th hurdle of the 3rd data rows.By that analogy, each unknown data 71 is replaced with relevant estimated data 72, fill up operation with the first stage of accomplishing unknown data, and the data that these are received in promptly are regarded as the follow-up data to be revised that will be used, and promptly like Fig. 3 are.
Then, carry out the correction operation of estimated data, like Figure 1B, after step S140, from remain to be revised appointment one data specific to be revised (step S150) the data, this specific correction data rows of classifying as of waiting to revise the data place.Please consult Fig. 4 simultaneously, remained to be revised data with before carrying out the institute that estimated data fills up, therefrom selection is as the data specific to be revised that will carry out data correction at present, and the row at its place promptly are regarded as a correction data rows.Below, the 1st data rows is recorded the specific data 82 of waiting to revise as the 3rd hurdle of not corrected correction data rows 81, the 1 data rows, in this, substitute with 0 again.
Then; According to the specific data variation trend of waiting to revise hurdle, data place; From all fields, select one first approximate specific data hurdle of data variation trend; According to the specific data of waiting to revise data place row, be to find out with crowd's mode to comprise a data rows group (step S160) of revising data rows with the identical data.The specific data variation trend degree of approximation of waiting to revise hurdle, data place; Be that height with the data benefit value of each field is as benchmark; Calculating about the data benefit value; That please consults Fig. 1 D simultaneously and be the embodiment of the invention seeks data rows group schematic flow sheet, and its step is following: need to calculate earlier the data benefit value (step S161) of each field of each data rows, to select the highest field of data benefit value as the aforesaid first specific data hurdle (step S162).The account form of each line data benefit value is following:
Cor ( i , j ) = Σ k = 1 m ( v i , k - Σ l = 1 m v i , l m ) ( v j , k - Σ l = 1 m v j , l m ) Σ k = 1 m ( v i , k - Σ l = 1 m v i , l m ) 2 Σ k = 1 m ( v j , k - Σ l = 1 m v j , l m ) 2 (formula 1)
So { cor (1, the unknown data column number of correction data rows), cor (2, the unknown data column number of correction data rows); Cor (4, the unknown data column number of correction data rows), cor (5; Revise the unknown data column number of data rows) }={ 0.867 ,-0.419 ,-0.062; 0.600} wherein, the unknown data column number of revising data rows 81 is 3.Example is learnt at this point, and the data benefit value for the highest, is first specific data hurdle 83 so look the 1st field with the 1st field.Therefore;, be all data rows to be divided group below, promptly be like Fig. 5 A and Fig. 5 B with crowd's mode with the identical data to the data of the 1st field; The 1st hurdle according to each data rows (is aforementioned the 1st field; Also be the first specific data hurdle 83) data, all data rows can be divided into four groups, and the 1st data rows, the 2nd data rows, the 3rd data rows and the 4th data rows are divided in same data rows group 84.
From these fields, select to wait to revise data place column number according to one second approximate specific data hurdle of variation tendency second with specific; According to the specific hurdle combination of waiting to revise the data hurdle, place and the second specific data hurdle, the hurdle combination is to divide each data with crowd's mode to classify a plurality of subgroup (step S170) as with the identical data according to this.
In this step, for reducing the complexity of data comparison, can just revise specific the treating of data rows earlier and revise the field that data 82 affiliated hurdles form, be with crowd's mode all data rows to be divided group with the identical data.Like Fig. 6 A and Fig. 6 B be, that revises data rows specificly waits to revise hurdle, data place and is in the 3rd hurdle, so to the data of the 3rd field, be with group mode each data rows to be marked off 4 groups with the identical data.Yet, revise data rows specific to wait to revise data 82 be 0, so it is neither influential to follow-up computing whether to have a group of one's own, ignore the correction data rows at this.
With regard to Fig. 5 A, the data benefit value is the 2nd height with the 4th hurdle, is the second specific data hurdle 83 ' so look the 4th hurdle.So the 3rd hurdle and the 4th hurdle of the 1st data rows are made up as the hurdle with reference to usefulness, compare with the data of forming on the 3rd hurdle and the 4th hurdle of each data rows, can be divided into 8 sub-groups again from 4 groups of former division.Wherein, the 3rd data rows and the 4th data rows (are all 4,2 because of the data combination on both the 3rd hurdle and the 4th hurdle is identical; Square frame choosing place among the figure), so the 3rd data rows and the 4th data rows are divided in same subgroup (the 7th subgroup 97 among the figure).Identical, revise data rows 81 specific to wait to revise data 82 be 0, so it is neither influential to follow-up computing whether to have a group of one's own, ignore correction data rows 81 at this.
Find out at least one target group of Data Matching data rows group from all subgroup, extrapolate a fill data to insert field specific to be repaiied (step S180) to utilize the corresponding specific data of repairing field of waiting of all target group.Its mode comprises, the data rows of the particular demographic in subgroup, and with the arbitrary person of conforming to of the data rows in the data rows group, the judgement particular demographic is a target group, at this moment, promptly specifying field to be revised is the specific data hurdle.
As shown in Figure 7, data rows group 84 comprises the 1st data rows, the 2nd data rows, the 3rd data rows and the 4th data rows.Yet; The 4th subgroup 94 comprises the 2nd data rows, and the 7th subgroup 97 comprises the 3rd data rows and the 4th data rows, with regard to mathematical meaning; The 4th subgroup 94 and the 7th subgroup 97 are comprised by data rows group 84; Promptly the 4th subgroup 94 and the 7th subgroup 97 are above-mentioned particular demographic, and the 3rd hurdle of the numerical value on the 3rd hurdle of the 4th subgroup 94 and the 7th subgroup 97 is above-mentioned specific data hurdle, and its numerical value will be used in by the calculating back and wait to revise in the field.Therefore, the fill data that the field specific to be repaiied of the 1st data rows should be inserted is that the numerical value on the 3rd column number value of the 4th subgroup 94 and the 3rd hurdle of the 7th subgroup 97 promptly is (3+4)/2=3.5 divided by 2.In addition, fill data promptly is " the specific numerical value totalling of waiting to repair field that is selected subgroup/be selected subgroup number ".So it is 3.5 that the field specific to be repaiied of the 1st data rows should be inserted numerical value.
Afterwards, judge and specificly wait to revise data place row whether other data to be revised (step S190) are arranged.Wait to revise when specific that data place row are total revises when finishing, promptly finish operation, otherwise; Then specify another data specific to be revised; Promptly be to return (step S150), continuing the flow process of step S150 to step S190, until all specific wait to revise data by total revise finish till.
Continuous Fig. 8 second kind of data array extremely shown in Figure 10 that see also changes and data rows group synoptic diagram, please consults Figure 1A to Fig. 1 D simultaneously and is beneficial to understand.Second kind of data array example figure of the embodiment of the invention shown in Figure 8 is example with the data array 11b of classification type.The tentation data array comprises 9 data rows; Wherein the 5th data rows, the 7th data rows and the 9th data are classified the missing value data rows as; Each missing value data rows comprises at least one unknown data 71 ', as the unknown data 71 ' of the 5th data rows in the 1st hurdle, the unknown data 71 ' of the 7th data rows in the 2nd hurdle, the unknown data 71 ' of the 9th data rows is in the 1st hurdle ... by that analogy.
Identical, through step S110 to step S140, data array shown in Figure 8, its all unknown data will be replaced by relevant estimated data, fill up operation with the first stage of accomplishing unknown data, promptly illustrate like Fig. 9.For example; Pearson correlation coefficient formula capable of using carries out the calculating of estimated data; Pearson correlation coefficient formula main concept is to be listed as like the analysis classes; The variation of its data value and average data value in each hurdle goes out the result with the mean value calculation that the missing value row will be arranged, and the mean value that is listed as according to institute's missing value of calculating again calculates the estimated data of missing value.
The Pearson correlation coefficient formula, as follows:
sim ( u , v ) = Σ i ∈ I ( r u , i - r u ‾ ) ( r v , i - r v ‾ ) Σ i ∈ I ( r u , i - r u ‾ ) 2 Σ i ∈ I ( r v , i - r v ‾ ) 2 , - - - ( 1 )
whereI=I u∩I v.
U wherein, V is represented as two data rows, r respectively U, i, r V, iBe respectively u, i, a j field value of V row then are the mean value of x row, and I is the hurdle set that two data rows have value jointly, is example with Fig. 2, and wherein the 2nd row calculate as follows with the 3rd row similarity
Figure BSA00000378860100111
Figure BSA00000378860100112
similarity (the 2nd row, the 3rd row)=((3-2.5) be (4-3.25)+(3-2.5) (3-3.25) (2-3.25)+(3-2.5))/((√ (3-2.5) 2+ (3-2.5) 2+ (3-2.5) 2) (√ (2-3.25) 2+ (4-3.25) 2+ (3-3.25) 2)=0.125/ (√ 0.25+0.25+0.25) (√ 0.5625+0.5625+0.0625))=0.14.
Next, go out the result according to the target hurdle value prediction of similar row more whereby, the general formula definition that uses is following:
P u , i = r u ‾ + Σ v ∈ U S u , v * ( r v , i - r v ‾ ) Σ v ∈ U | S u , v | , - - - ( 1 )
where?U=all?similar?users?with?u,
P wherein U, iBeing the target hurdle value that u is listed as the i hurdle, is the average hurdle value of u row, S U, vBe expressed as the similarity of u row and V row; With Fig. 2 for instance, what suppose to want to predict is the 2nd to be listed as the value on the 1st hurdle, and decision at first earlier is listed as other the most relevant data rows with the 2nd; We can find the most similar with the 2nd row with the 1st row in Fig. 2; The similarity result calculated is respectively 0.353, so the last P2 that predicts the outcome, 1=2.5+ (0.353* (4-3))/0.353=3.5.
Yet, different with previous embodiment be in, the data of the data rows of preceding embodiment are the numeric type data, estimated data 72 ' is missing value data rows under its unknown data of inserting in advance 71 ', the mean values of the relevant given data of its corresponding target data row.Yet the data of this routine data rows are classification type data, and estimated data 72 ' is a missing value data rows under its unknown data that substitutes in advance 71 ', the highest data of occurrence number in the relevant given data of its corresponding target data row.For example, suppose that the target data of the 5th data rows correspondence is classified the 1st data example to the 4 data rows as, in the 1st hurdle of these data rows, the data that occur the most repeatedly are L, and the numerical value on the 1st hurdle of the 5th data rows is promptly estimated is L.
Similar, second kind of data array shown in Figure 9 after preliminary estimated data 72 ' is inserted, through step S150 to step S190 to revise the data specific to be revised of each missing value data rows, the fill data 85 that changes to calculate substitutes, and is shown in figure 10.
Routine at this point; Step S150 can be with reference to existing known techniques to step S190, and for example document is " T.P.Hong, L.H.Tseng; and S.L.Wang. " Learning rules from incompletetraining examples by rough sets. " Expert Systems with Applications; computing " is carried out in Vol.22, pp.285,2002..
The above only is preferred embodiment of the present invention, is not the present invention is done any pro forma restriction; Though the present invention discloses as above with preferred embodiment; Yet be not in order to limiting the present invention, anyly be familiar with the professional and technical personnel, in not breaking away from technical scheme scope of the present invention; When the technology contents of above-mentioned announcement capable of using is made a little change or is modified to the equivalent embodiment of equivalent variations; In every case be not break away from technical scheme content of the present invention, to any simple modification, equivalent variations and modification that above embodiment did, all still belong in the scope of technical scheme of the present invention according to technical spirit of the present invention.

Claims (10)

1. the system of filling up of a data missing value is characterized in that comprising:
One storage element, it stores a data array, and this data array comprises a plurality of data rows and a plurality of field, and these data rows comprise a plurality of partial data row and a plurality of missing value data rows, and each missing value data rows comprises at least one unknown data; And
One computing equipment, it comprises:
One routine analyzer; And
One processor; In order to read and to utilize this data array of this parser analysis; Wherein, this processor is found out the approximate at least one target data row of each missing value data rows from these partial data row, takes out at least one given data to extrapolate an estimated data from it; To replace its these indivedual corresponding unknown data and as a plurality of data to be revised; Revise data from this wait again and find out data specific to be revised,, from these fields, select the one first specific data hurdle and the one second specific data hurdle of the approximate order of data variation trend with this specific data variation trend of waiting to revise hurdle, data place; And according to these specific data of waiting to revise data place row; With the identical data is to find out a data rows group with crowd's mode, according to the hurdle combination on this data rows group and this second specific data hurdle, is to divide these data with crowd's mode to classify a plurality of subgroup as with the identical data again; And wherein find out at least one target group of this data rows group of Data Matching; To utilize it to extrapolate a fill data to insert this field specific to be revised, judge again whether should specificly wait to revise data place row has other data to be revised, to determine whether to specify another data specific to be revised to should specificly waiting the data of revising field.
2. the system of filling up of data missing value as claimed in claim 1; It is characterized in that this processor is a partial data curve of setting up each partial data row; Set up a missing value data and curves of each missing value data rows; And compare the similarity of each missing value data and curves and these partial data curves, to find out the corresponding at least one approximate target data curve of each missing value data and curves from these partial data curves; And right according to these missing value data and curves and this target data curve respectively, find out each missing value data rows at least one the most approximate approximate target data row.
3. the system of filling up of data missing value as claimed in claim 1; It is characterized in that this processor is a data rows of judging a particular demographic in this subgroup; With data rows in this data rows group during arbitrary conforming to; Judge that this particular demographic is this target group, specifying field to be revised again is the specific data hurdle.
4. the system of filling up of data missing value as claimed in claim 1, the data that it is characterized in that these data rows are the numeric type data, this fill data is the average of the numerical value in this specific data hurdle of this at least one target group.
5. the system of filling up of data missing value as claimed in claim 1; The data that it is characterized in that these data rows are classification type data; This estimated data is this missing value field under its unknown data hurdle of inserting in advance, the data in this at least one given data of these at least one target data row that it is corresponding.
6. the complementing method of a data missing value is applicable to a data array, and this data array comprises a plurality of data rows and a plurality of field, it is characterized in that this method comprises:
From this data array, find out a plurality of partial data row and a plurality of missing value data rows, each missing value data rows comprises at least one unknown data;
From these partial data row, be respectively each missing value data rows and take out approximate at least one target data row;
Put in the field of affiliated this missing value data rows according to each unknown data, obtain at least one given data, and utilize this at least one given data to extrapolate an estimated data from these at least one target data row that this missing value data rows is corresponding;
These estimated datas are replaced its indivedual these corresponding unknown data, with as a plurality of data to be revised;
Revise from this wait and to specify data specific to be revised data, this is specific waits to revise the data place and classifies one as and revise data rows;
According to this specific data variation trend of waiting to revise hurdle, data place; From these fields, select one first approximate specific data hurdle of data variation trend; And, be to find out a data rows group with crowd's mode with the identical data according to these specific data of waiting to revise data place row;
From these fields, select and should specificly wait to revise the one second specific data hurdle that data place column number is similar to according to variation tendency second; According to this specific hurdle combination of waiting to revise the data hurdle, place and the second specific data hurdle, be to divide these data with crowd's mode to classify a plurality of subgroup as with the identical data according to this combination;
Find out few target group of this data rows group of Data Matching from these subgroup, extrapolate a fill data to insert this specific hurdle of waiting to revise data to should specificly waiting the data of revising field to utilize this at least one target group; And
Judge this and specificly wait to revise data place row whether other data to be revised are arranged, whether specify another data specific to be revised with decision.
7. the complementing method of data missing value as claimed in claim 6 is characterized in that from these partial data row, being respectively this step of taking out the approximate at least one target data row of each missing value data rows taking-up and comprises:
Set up each partial data row one partial data curve;
Set up each missing value data rows one missing value data and curves:
Compare the similarity of each missing value data and curves and these partial data curves, to find out the corresponding at least one approximate target data curve of each missing value data and curves from these partial data curves; And
right according to these missing value data and curves and this target data curve respectively found out each missing value data rows at least one the most approximate approximate target data row.
8. the complementing method of data missing value as claimed in claim 6 is characterized in that this step of finding out this at least one target group of data rows group of Data Matching from these subgroup comprises:
The data rows of a particular demographic in this subgroup with the arbitrary person of conforming to of data rows in this data rows group, judges that this particular demographic is this target group; And
Appointment field to be revised is the specific data hurdle.
9. the complementing method of data missing value as claimed in claim 6, the data that it is characterized in that these data rows are the numeric type data, this fill data is the average of the numerical value in this specific data hurdle of this at least one target group.
10. the complementing method of data missing value as claimed in claim 6; The data that it is characterized in that these data rows are classification type data; This estimated data is this missing value field under its unknown data hurdle of inserting in advance, the data in this at least one given data of these at least one target data row that it is corresponding.
CN2010105799328A 2010-12-02 2010-12-02 System and method for filling data missing value Pending CN102486790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105799328A CN102486790A (en) 2010-12-02 2010-12-02 System and method for filling data missing value

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105799328A CN102486790A (en) 2010-12-02 2010-12-02 System and method for filling data missing value

Publications (1)

Publication Number Publication Date
CN102486790A true CN102486790A (en) 2012-06-06

Family

ID=46152283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105799328A Pending CN102486790A (en) 2010-12-02 2010-12-02 System and method for filling data missing value

Country Status (1)

Country Link
CN (1) CN102486790A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015154568A1 (en) * 2014-08-11 2015-10-15 中兴通讯股份有限公司 Data collection optimization method and system, and server
CN107341202A (en) * 2017-06-21 2017-11-10 平安科技(深圳)有限公司 Appraisal procedure, device and the storage medium of business datum table amendment risk factor
WO2020140662A1 (en) * 2019-01-02 2020-07-09 深圳壹账通智能科技有限公司 Data table filling method, apparatus, computer device, and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015154568A1 (en) * 2014-08-11 2015-10-15 中兴通讯股份有限公司 Data collection optimization method and system, and server
CN107341202A (en) * 2017-06-21 2017-11-10 平安科技(深圳)有限公司 Appraisal procedure, device and the storage medium of business datum table amendment risk factor
CN107341202B (en) * 2017-06-21 2018-06-08 平安科技(深圳)有限公司 Business datum table corrects appraisal procedure, device and the storage medium of danger level
WO2018233117A1 (en) * 2017-06-21 2018-12-27 平安科技(深圳)有限公司 Method and device for evaluating correction risk factors of business data tables, and storage medium
WO2020140662A1 (en) * 2019-01-02 2020-07-09 深圳壹账通智能科技有限公司 Data table filling method, apparatus, computer device, and storage medium

Similar Documents

Publication Publication Date Title
Greenacre Compositional data analysis
Eason et al. Evaluating the sustainability of a regional system using Fisher information in the San Luis Basin, Colorado
US20160070950A1 (en) Method and system for automatically assigning class labels to objects
CN106886601A (en) A kind of Cross-modality searching algorithm based on the study of subspace vehicle mixing
CN104504583B (en) The evaluation method of grader
CN102693452A (en) Multiple-model soft-measuring method based on semi-supervised regression learning
Xie et al. A novel method to attribute reduction based on weighted neighborhood probabilistic rough sets
Saed-Moucheshi et al. A review on applied multivariate statistical techniques in agriculture and plant science.
Wang et al. Optimization of the number of components in the mixed model using multi-criteria decision-making
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context
Skinner Analysis of categorical data for complex surveys
CN102486790A (en) System and method for filling data missing value
CN113724195B (en) Quantitative analysis model and establishment method of protein based on immunofluorescence image
CN114463587A (en) Abnormal data detection method, device, equipment and storage medium
CN102141988B (en) Method, system and device for clustering data in data mining system
Freulon et al. CytOpT: Optimal transport with domain adaptation for interpreting flow cytometry data
CN115600102B (en) Abnormal point detection method and device based on ship data, electronic equipment and medium
TWI599896B (en) Multiple decision attribute selection and data discretization classification method
Li et al. Intelligent product-gene acquisition method based on K-means clustering and mutual information-based feature selection algorithm
Himani et al. A comparative study on machine learning based prediction of citations of articles
RU2586025C2 (en) Method for automatic clustering of objects
CN108286957A (en) A kind of Flatness error evaluation method of fast steady letter
CN112735596A (en) Similar patient determination method and device, electronic equipment and storage medium
CN101710392B (en) Important information acquiring method based on variable boundary support vector machine
Srinivasarao et al. Introduction to data science: Review, challenges, and opportunities

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120606