CN108038131A - Data Quality Analysis preprocess method and device, storage medium, terminal - Google Patents
Data Quality Analysis preprocess method and device, storage medium, terminal Download PDFInfo
- Publication number
- CN108038131A CN108038131A CN201711146673.8A CN201711146673A CN108038131A CN 108038131 A CN108038131 A CN 108038131A CN 201711146673 A CN201711146673 A CN 201711146673A CN 108038131 A CN108038131 A CN 108038131A
- Authority
- CN
- China
- Prior art keywords
- network
- scoring
- label value
- node
- incidence relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of Data Quality Analysis preprocess method and device, storage medium, terminal, Data Quality Analysis preprocess method include:Extract the label value source for the data that multiple suppliers provide;Node using the label value source and corresponding label value as network, and the incidence relation between each two node is determined according to the default incidence relation between the label value source, the incidence relation includes set membership and strength of association;The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is directed acyclic network;It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring, for data to be assessed are carried out with the assessment of accuracy rate.Technical solution of the present invention can build the Bayesian network for Data Quality Analysis, to improve the accuracy of Data Quality Analysis.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of Data Quality Analysis preprocess method and device,
Storage medium, terminal.
Background technology
Data Quality Analysis is needed based on the contrast with data actual value, but the actual value in big data FIELD Data is past
It is past to hardly result in.
Current existing method is mainly to determine data accuracy by way of separate sources side's ballot of data.Statistics
Different information sides substantially judge data.For example the user for mobile equipment, mobile phone operators can be based on downloading
Application program (Application, app) judge user be male;Love and marriage website can be based on the information that user makes a report on
It is women to think user.
The prior art cannot be distinguished by the difference of the data providing quality for data assessment itself, and then cause to data
The assessment of quality is inaccurate.
The content of the invention
Present invention solves the technical problem that it is how to be built using the label value basis for estimation of the data of supplier's offer
Bayesian network, to improve the accuracy of Data Quality Analysis.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of Data Quality Analysis preprocess method, data matter
Amount analysis preprocess method includes:Extract the label value source for the data that multiple suppliers provide;By the label value source with
And node of the corresponding label value as network, and each two is determined according to the default incidence relation between the label value source
Incidence relation between node, the incidence relation include set membership and strength of association;Calculate the sub-network of the network
Likelihood score scores and complexity scoring, and the sub-network is directed acyclic network;Based on likelihood score scoring and the complexity
It is Bayesian network that corresponding sub-network is chosen in degree scoring, for data to be assessed are carried out with the assessment of accuracy rate.
It is optionally, described that based on likelihood score scoring, corresponding sub-network is Bayes with complexity scoring selection
Network includes:The sub-network for choosing the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network;
Alternatively, it is the pattra leaves to choose the likelihood score scoring and the difference of complexity scoring more than one of sub-network of given threshold
This network.
Optionally, the likelihood score scoring of the sub-network for calculating the network and complexity score and include:According to being taken
The Bayesian network and the fitting degree of actual sample built calculate the likelihood score and score;According to possessing in the sub-network
The number of nodes and node total number of set membership calculate the complexity scoring.
Optionally, the default incidence relation according between the label value source determines the pass between each two node
Connection relation includes:The default incidence relation between the label value source is determined by searching for default incidence relation list, it is described
Default incidence relation includes stating strength of association and sequencing between label value source;Will be default between the label value source
Incidence relation is integrally formed the incidence relation between node.
Optionally, the likelihood score scoring and complexity that the sub-network of the network is calculated using bayesian information criterion are commented
Point.
Optionally, the Data Quality Analysis preprocess method further includes:Label value in the data to be assessed is come
Source is matched with the node in the Bayesian network, and according to the pass between each two node to match in matching result
Connection relation calculates the accuracy rate of the data to be assessed.
The embodiment of the invention also discloses a kind of Data Quality Analysis pretreatment unit, Data Quality Analysis pretreatment unit
Including:Label value source extraction module, suitable for extracting the label value source for the data that multiple suppliers provide;Node determines mould
Block, suitable for the node using the label value source and corresponding label value as network, and according to the label value source it
Between default incidence relation determine incidence relation between each two node, the incidence relation includes set membership and associates by force
Degree;Computing module, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the sub-network is oriented
Acyclic Network;Bayesian network determining module, it is corresponding suitable for being chosen based on likelihood score scoring and complexity scoring
Sub-network is Bayesian network, for data to be assessed are carried out with the assessment of accuracy rate.
Optionally, the Bayesian network determining module includes:First chooses unit, suitable for choosing the likelihood score scoring
Sub-network with the difference maximum of complexity scoring is the Bayesian network;Second choose unit, suitable for choose described in seemingly
It is the Bayesian network that so the difference of degree scoring and complexity scoring, which is more than one of sub-network of given threshold,.
Optionally, the computing module includes:Likelihood score scoring computing unit, suitable for according to the Bayesian network built
The likelihood score is calculated with the fitting degree of actual test sample to score;Complexity scoring computing unit, suitable for according to the son
Possess the number of nodes of the set membership in network and node total number calculates the complexity scoring.
Optionally, the node determining module includes:Searching unit, suitable for being determined by searching for default incidence relation list
Default incidence relation between the label value source, the default incidence relation include stating strength of association between label value source
And sequencing;Integral unit, suitable for the default incidence relation between the label value source is integrally formed between node
Incidence relation.
Optionally, the computing module calculates the likelihood score scoring of the sub-network of the network using bayesian information criterion
Score with complexity.
Optionally, the Data Quality Analysis pretreatment unit further includes:Evaluation module, suitable for by the data to be assessed
In label value source matched with the node in the Bayesian network, and according to each two to match in matching result
Incidence relation between node calculates the accuracy rate of the data to be assessed.
The embodiment of the invention also discloses a kind of storage medium, is stored thereon with computer instruction, the computer instruction
The step of Data Quality Analysis preprocess method is performed during operation.
The embodiment of the invention also discloses a kind of terminal, including memory and processor, being stored with the memory can
The computer instruction run on the processor, the processor perform the quality of data when running the computer instruction
The step of analyzing preprocess method.
Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that:
Technical solution of the present invention extracts the label value source for the data that multiple suppliers provide;By the label value source with
And node of the corresponding label value as network, and each two is determined according to the default incidence relation between the label value source
Incidence relation between node, the incidence relation include set membership and strength of association;Calculate the sub-network of the network
Likelihood score scores and complexity scoring, and the sub-network is directed acyclic network;Based on likelihood score scoring and the complexity
It is Bayesian network that corresponding sub-network is chosen in degree scoring, for data to be assessed are carried out with the assessment of accuracy rate.The present invention
Technical solution forms network using the default incidence relation between label value source and label value source, and by the net
Network carries out likelihood score scoring and complexity scoring to establish the Bayesian network of directed acyclic.The pattra leaves established using aforesaid way
This network effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment;In addition, structure pattra leaves
The complexity of network is considered during this network, compared with existing Bayesian network, volumes of searches in Data Quality Analysis compared with
It is small.In addition, each of the Bayesian network that technical solution of the present invention utilizes branches into a naive Bayesian, reality can be solved
Multi-level influence relation in operation, compared to Nae Bayesianmethod, Bayesian network has bright in terms of the accuracy of assessment
Aobvious raising.
Further, the likelihood score is calculated according to the fitting degree for the Bayesian network and actual sample built to score;
The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.This hair
Bright technical solution calculates the likelihood score by using the fitting degree for the Bayesian network and actual sample built and scores, from
And when choosing Bayesian network using likelihood score scoring, significantly more efficient Bayesian network can be chosen, after further improving
The accuracy of continuous Data Quality Analysis;By using described in number of nodes and the node total number calculating for possessing the set membership
Complexity scores, and can choose the lower Bayesian network of complexity when choosing Bayesian network using complexity scoring
Network, so as to improve the efficiency of follow-up data quality analysis.
Further, default incidence relation between the label value source is determined by searching for default incidence relation list,
The default incidence relation includes stating strength of association and sequencing between label value source;By between the label value source
Default incidence relation is integrally formed the incidence relation between node.In technical solution of the present invention, default incidence relation can be pre-
The prior information first established;Network is established using prior information so that it is less by the content of sample learning, so as to solve
The problem of sample is insufficient in practical operation.
Brief description of the drawings
Fig. 1 is a kind of flow chart of Data Quality Analysis preprocess method of the embodiment of the present invention;
Fig. 2 is a kind of structure diagram of network of the embodiment of the present invention;
Fig. 3 is a kind of structure diagram of Bayesian network of the embodiment of the present invention;
Fig. 4 is a kind of structure diagram of Data Quality Analysis pretreatment unit of the embodiment of the present invention;
Fig. 5 is a kind of structure diagram of embodiment of Bayesian network determining module shown in Fig. 4 404;
Fig. 6 is a kind of structure diagram of embodiment of computing module 403 shown in Fig. 4;
Fig. 7 is a kind of structure diagram of embodiment of node determining module shown in Fig. 4 402.
Embodiment
As described in the background art, the prior art cannot be distinguished by the difference of the data providing quality for data assessment itself
It is different, and then cause the assessment to the quality of data inaccurate.
Technical solution of the present invention forms net using the default incidence relation between label value source and label value source
Network, and establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using upper
The Bayesian network that the mode of stating is established effectively can assess the quality of data of data to be assessed, improve the accurate of assessment
Property;In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in the quality of data
Volumes of searches during analysis is smaller.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention
Specific embodiment be described in detail.
Fig. 1 is a kind of flow chart of Data Quality Analysis preprocess method of the embodiment of the present invention.
The Data Quality Analysis preprocess method may comprise steps of:
Step S101:Extract the label value source for the data that multiple suppliers provide;
Step S102:Node using the label value source and corresponding label value as network, and according to the mark
Default incidence relation between label value source determines the incidence relation between each two node, and the incidence relation is closed including father and son
System and strength of association;
Step S103:The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is to have
To Acyclic Network;
Step S104:It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring
Network, for data to be assessed are carried out with the assessment of accuracy rate.
In the present embodiment, the data that supplier provides can include key assignments and its label value.Each label value is included at least
One label value source.The label value source can represent the judgment basis of the label value.For example, for key assignments gender, its
The label value source of label value male or female can include:Whether steady operation is had, whether to interested, beauty makeups of racing
Class application program (Application, APP) opens frequency, sport category APP opens frequency, military class APP opens frequency etc..
In order to build the Bayesian network for Data Quality Analysis, in the specific implementation of step S101, extraction is multiple
The label value source for the data that supplier provides.And in the specific implementation of step S102, utilize label value source and label
Value structure network.The network includes multiple nodes, can possess incidence relation between node.The incidence relation includes father and son
Relation and strength of association.Specifically, the set membership of node can represent the ordinal relation between node;Pass between node
Connection intensity can be represented using conditional probability, can also be represented using proportionate relationship, the embodiment of the present invention does not limit this
System.
In the lump with reference to Fig. 2, in the network architecture shown in Fig. 2, node Y represents label value " male or female ";Nodes X 1 represents
Whether label value source " has steady operation ";Nodes X 2 represents label value source " whether interested in racing ";Nodes X 3 represents
Label value source " beauty makeups class APP opens frequency ";Nodes X 4 represents label value source " sport category APP opens frequency ";Section
Point X5 represents label value source " military class APP opens frequency ".
Due to possessing default incidence relation between label value source, thus may determine that the association between each node is closed
System.For example, male has 10% to use makeups class APP, women has 50% to use makeups class APP, and only 10% women high frequency is opened
Military class APP, and male's military affairs class APP opening rates are up to 60%.Thus, when node Y is male, nodes X 2 is nodes X 4
Father node, strength of association 10%, as shown in Fig. 2, connecting line is directed toward nodes X 4 by nodes X 2;Nodes X 2 is father's section of nodes X 5
Point, strength of association 60%, as shown in Fig. 2, connecting line is directed toward nodes X 5 by nodes X 2.When node Y is women, between node
Set membership and strength of association and so on, details are not described herein again.
Further, incidence relation between node is determined according to the default incidence relation between label value source, with mark
Exemplified by label value source " sport category APP opens frequency " and label value source " military class APP opens frequency ", default incidence relation refers to
Show that 60% probability both sides relation is not known, 40% probability tag value source " sport category APP opens frequency " influences label value
Source " military class APP opens frequency ", and 30% probability tag value source " military class APP opens frequency " influence label value
Source " sport category APP opens frequency ".With continued reference to Fig. 2, the set membership between nodes X 4 and nodes X 5 be it is two-way, two
Node interacts.
By the network of step S102 structures there may be cyclic structure and there are bidirectional relationship, Bayesian network is not met
The characteristic of network directed acyclic.Therefore above-mentioned network is modified in the specific implementation of step S103 and step S104, determine shellfish
This network of leaf.
Specifically, the network can be disassembled as the sub-network of multiple directed acyclics, and each sub-network is calculated
Likelihood score scores and complexity scoring.Likelihood score, which scores, can characterize the fitting journey of built Bayesian network and actual sample
Degree;To characterize the structure complexity of sub-network, the structure complexity of sub-network can influence subsequently to carry out data to comment for complexity scoring
The efficiency estimated.Furthermore, likelihood score scoring is higher, and the degree of fitting of model and actual sample is higher;Complexity scoring is higher,
The structure complexity of sub-network is higher.
Preferably, bayesian information criterion (Bayesian Information Criterion, BIC) algorithm can be used
Network is modified, determines Bayesian network.The advantage of BIC algorithms, which is to examine by using conditional independence, to be found
The dependency structure of network.BIC algorithms divide two parts, and a part scores likelihood score, while another part is to complexity button
Point.If selecting network only in accordance with likelihood score scoring, most complicated complete Bayesian network can be chosen, causes overfitting.Cause
And the scoring preference pattern based on BIC needs to select and data fitting and better simply model.In the above described manner can with to net
Bidirectional relationship and annular section in network are rejected, and ensure the availability of Bayesian network.
Cooper and Herskovits (Cooper&Herskovits, CH) algorithm can also be used to be modified network, really
Determine Bayesian network.It will be apparent to a skilled person that any other enforceable existing algorithm can also be used real
Existing above-mentioned purpose, the embodiment of the present invention are without limitation.
Specifically, in the lump with reference to Fig. 2 and Fig. 3.After step S103 and step S104, definite Bayesian network is such as
Shown in Fig. 3.Compared with network shown in Fig. 2, Bayesian network is directed acyclic graph (Directed Acyclic Graph, DAG).
After amendment, nodes X 5 and nodes X 3 do not have incidence relation;Nodes X 5 is the child node of nodes X 4;Nodes X 4 is nodes X 3
Child node.
The Bayesian network that the embodiment of the present invention determines can include representing node and connect these node directed edges.Node
Stochastic variable can be represented, the incidence relation that the directed edge between node can be represented between node (is directed toward its sub- section by father node
Point), expression strength of association is carried out with conditional probability, strength of association is expressed without father node prior probability.Node can be
Any problem is abstracted, such as:Z test values, observation phenomenon, opinion are seeked the opinion of.Bayesian network can be used for expressing and analyze
Uncertain and probabilistic event, applied to conditionally rely on various control factor decision-making, can from not exclusively, it is not smart
Reasoning is made in true or uncertain knowledge or information.
The embodiment of the present invention forms network using the default incidence relation between label value source and label value source,
And establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using above-mentioned side
The Bayesian network that formula is established effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment;
In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in Data Quality Analysis
When volumes of searches it is smaller.
The present embodiment can be judged to be evaluated after Bayesian network is established by the incidence relation of the node of Bayesian network
Estimate the accuracy of data.
Preferably, step S104 may comprise steps of:Choose the likelihood score scoring and the complexity scores it
Poor maximum sub-network is the Bayesian network;Alternatively, the difference for choosing the likelihood score scoring and complexity scoring is big
In one of sub-network of given threshold be the Bayesian network.
In the present embodiment, it can score with reference to likelihood score and complexity scoring chooses Bayes in multiple sub-networks
Network.As it was previously stated, likelihood score scoring can characterize the fitting degree of built Bayesian network and actual sample, Ge Gejie
Strength of association between point can influence the accuracy of follow-up data assessment;Complexity scores to characterize the complicated of sub-network
Degree, the structure complexity of sub-network can influence subsequently to carry out the efficiency of data assessment.Therefore the pass of Bayesian network in order to balance
Join intensity and complexity, the sub-network that can choose the difference maximum of likelihood score scoring and complexity scoring is the Bayes
Network, or it is the Bayesian network that the difference of likelihood score scoring and complexity scoring, which is more than one of sub-network of given threshold,
Network.
Preferably, step S103 may comprise steps of:The likelihood score scoring is calculated according to sample fitting degree;Root
The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.
In the present embodiment, the network and actual test sample fitting degree built are stronger, and the likelihood score scoring is higher;Institute
State the number of nodes for possessing the set membership in sub-network and/or node total number is bigger, the complexity scoring is higher.
It is possible to further calculate the scoring of the likelihood score of the sub-network of the network and complexity using bayesian information criterion
Degree scoring.
Preferably, step S102 may comprise steps of:The label is determined by searching for default incidence relation list
The default incidence relation being worth between source, the default incidence relation include stating between label value source strength of association and successively suitable
Sequence;The incidence relation default incidence relation between the label value source being integrally formed between node.
In the embodiment of the present invention, default incidence relation can be the prior information pre-established;Established using prior information
Network so that it is less by the content of sample learning, so as to solve the problems, such as that sample is insufficient in practical operation.
In a concrete application scene of the invention, default incidence relation can be by multiple experts to the label value come
Assessed with formation in source.With label value source " sport category APP opens frequency " (hereinafter referred to as the former) and label value source
Exemplified by " military class APP opens frequency " (hereinafter referred to as the latter), its default incidence relation can have following several:60% probability
Both sides relation is not known, 40% probability the former influence the latter;70% probability both sides relation is not known, 30% probability the latter
Influence the former;40% probability both sides relation is not known, and both 30% probability are unrelated, 30% probability the former influence the latter.
The incidence relation of two nodes can be determined using DS evidence theories (D-S evidence theory).It is specific and
Speech, further determines that the evidence relation between each two node according to DS combining evidences formula, lists the knowledge of each two node
Fusion results, to form network.
In a preferred embodiment of the invention, the Data Quality Analysis preprocess method shown in Fig. 1 can also include following
Step:Label value source in the data to be assessed is matched with the node in the Bayesian network, and according to
The accuracy rate of the data to be assessed is calculated with the incidence relation between each two node to match in result.
In the present embodiment, for each data to be assessed, the node in its label value source and Bayesian network is compared
It is right;Calculate matched node and accurate label value is determined to the conditional probability of label value node.If the label value of data to be assessed
Identical with accurate label value, then the data to be assessed are accurate.
For multiple data to be assessed, whether accuracy rate can be accurately calculated according to each data to be assessed.
In another of the invention concrete application scene, when establishing Bayesian network, first according in advance to label value come
Relation between source carries out the default incidence relation of analysis formation, and each two node is further determined that using DS combining evidences formula
Between evidence relation, list the knowledge fusion result of wherein each two node.Further, it is also possible to part is removed without practical significance
Causality, obtains initial model.For the uncertain situation of incidence relation between node, can add between two nodes
Add two-way side.The initial data that multiple suppliers provide is then based on, bayesian network structure is carried out using algorithm
Practise.Final Bayesian network is obtained by BIC algorithm evaluations again.Compared with the common searching algorithm of existing Bayesian network,
Volumes of searches is smaller.
Present inventor's effect of the application by verification experimental verification.200 cellies' of investigation is true under line
Gender data, wherein 100 model trainings (parameter regulation) for being respectively used to naive Bayesian and Bayesian network, residue 100
Data is used for on-line testing, to obtain test result.Test result is shown, is using the False Rate of Nae Bayesianmethod
9%, and the use of the False Rate of Bayesian network is only 4%.Thus, assessed compared to Nae Bayesianmethod, Bayesian network
Accuracy in terms of tool be significantly improved.
Fig. 4 is refer to, Data Quality Analysis pretreatment unit 40 can include:
Label value source extraction module 401, suitable for extracting the label value source for the data that multiple suppliers provide;
Node determining module 402, suitable for the node using the label value source and corresponding label value as network, and
Incidence relation between each two node, the incidence relation are determined according to the default incidence relation between the label value source
Including set membership and strength of association;
Computing module 403, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the subnet
Network is directed acyclic network;
Bayesian network determining module 404, suitable for corresponding with complexity scoring selection based on likelihood score scoring
Sub-network be Bayesian network, for data to be assessed carry out accuracy rate assessment.
The embodiment of the present invention forms network using the default incidence relation between label value source and label value source,
And establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using above-mentioned side
The Bayesian network that formula is established effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment;
In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in Data Quality Analysis
When volumes of searches it is smaller.
Preferably, Fig. 5 is refer to, Bayesian network determining module 404 can include first and choose unit 4041, suitable for choosing
The sub-network for taking the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network;Second chooses unit
4042, it is described suitable for choosing the difference of the likelihood score scoring and complexity scoring more than one of sub-network of given threshold
Bayesian network.
Preferably, Fig. 6 is refer to, computing module 403 can include likelihood score scoring computing unit 4031, suitable for according to every
Strength of association between two nodes calculates the likelihood score scoring;Complexity scoring computing unit 4032, suitable for according to
Possess the number of nodes of the set membership in sub-network and node total number calculates the complexity scoring.
Further, the computing module 403 can utilize the sub-network of the bayesian information criterion calculating network
Likelihood score scores and complexity scoring.
Preferably, Fig. 7 is refer to, node determining module 402 can include searching unit 4021, suitable for by searching for default
Incidence relation list determines the default incidence relation between the label value source, and the default incidence relation includes stating label value
Strength of association and sequencing between source;Integral unit 4022, suitable for the default association between the label value source is closed
System is integrally formed the incidence relation between node.
Data Quality Analysis pretreatment unit shown in Fig. 4 40 can also include evaluation module (not shown), suitable for by described in
Label value source in data to be assessed is matched with the node in the Bayesian network, and according to phase in matching result
Incidence relation between each two node matched somebody with somebody calculates the accuracy rate of the data to be assessed.
More contents of operation principle, working method on the Data Quality Analysis pretreatment unit 40, Ke Yican
According to the associated description in Fig. 1 to Fig. 3, which is not described herein again.
The embodiment of the invention also discloses a kind of storage medium, is stored thereon with computer instruction, the computer instruction
The step of Data Quality Analysis preprocess method shown in Fig. 1 can be performed during operation.The storage medium can include
ROM, RAM, disk or CD etc..
The embodiment of the invention also discloses a kind of terminal, the terminal can include memory and processor, the storage
The computer instruction that can be run on the processor is stored with device.The processor can be with when running the computer instruction
The step of performing the Data Quality Analysis preprocess method shown in Fig. 1.The terminal include but not limited to mobile phone, computer,
The terminal devices such as tablet computer.
Although present disclosure is as above, the present invention is not limited to this.Any those skilled in the art, are not departing from this
In the spirit and scope of invention, it can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
Subject to the scope of restriction.
Claims (14)
- A kind of 1. Data Quality Analysis preprocess method, it is characterised in that including:Extract the label value source for the data that multiple suppliers provide;Node using the label value source and corresponding label value as network, and according between the label value source Default incidence relation determines the incidence relation between each two node, and the incidence relation includes set membership and strength of association;The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is directed acyclic network;It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring, for treating Assess the assessment that data carry out accuracy rate.
- 2. Data Quality Analysis preprocess method according to claim 1, it is characterised in that described to be based on the likelihood score Corresponding sub-network is chosen in scoring and complexity scoring to be included for Bayesian network:The sub-network for choosing the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network;Alternatively, it is described to choose the likelihood score scoring and the difference of complexity scoring more than one of sub-network of given threshold Bayesian network.
- 3. Data Quality Analysis preprocess method according to claim 1, it is characterised in that the calculating network The likelihood score scoring and complexity scoring of sub-network include:The likelihood score is calculated according to the fitting degree of network and actual test sample to score;The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.
- 4. Data Quality Analysis preprocess method according to claim 1, it is characterised in that described according to the label value Default incidence relation between source determines that the incidence relation between each two node includes:The default incidence relation between the label value source, the default association are determined by searching for default incidence relation list Relation includes stating strength of association and sequencing between label value source;The incidence relation default incidence relation between the label value source being integrally formed between node.
- 5. Data Quality Analysis preprocess method according to claim 1, it is characterised in that utilize bayesian information criterion Calculate the likelihood score scoring and complexity scoring of the sub-network of the network.
- 6. Data Quality Analysis preprocess method according to claim 1, it is characterised in that further include:Label value source in the data to be assessed is matched with the node in the Bayesian network, and according to matching As a result the incidence relation between each two node to match in calculates the accuracy rate of the data to be assessed.
- A kind of 7. Data Quality Analysis pretreatment unit, it is characterised in that including:Label value source extraction module, suitable for extracting the label value source for the data that multiple suppliers provide;Node determining module, suitable for the node using the label value source and corresponding label value as network, and according to institute State the default incidence relation between label value source and determine incidence relation between each two node, the incidence relation includes father Subrelation and strength of association;Computing module, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the sub-network is to have To Acyclic Network;Bayesian network determining module, suitable for choosing corresponding sub-network based on likelihood score scoring and complexity scoring For Bayesian network, for data to be assessed are carried out with the assessment of accuracy rate.
- 8. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the Bayesian network determines Module includes:First chooses unit, is described suitable for choosing the sub-network of the difference maximum of the likelihood score scoring and complexity scoring Bayesian network;Second chooses unit, suitable for choosing the subnet that the difference of the likelihood score scoring and complexity scoring is more than given threshold One of network is the Bayesian network.
- 9. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the computing module includes:Likelihood score scoring computing unit, scores suitable for calculating the likelihood score according to the strength of association between each two node;Complexity scoring computing unit, suitable for according to the number of nodes and node for possessing the set membership in the sub-network Sum calculates the complexity scoring.
- 10. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the node determining module Including:Searching unit, suitable for determining that the default association between the label value source is closed by searching for default incidence relation list System, the default incidence relation include stating strength of association and sequencing between label value source;Integral unit, the association suitable for being integrally formed the default incidence relation between the label value source between node are closed System.
- 11. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the computing module utilizes Bayesian information criterion calculates the likelihood score scoring and complexity scoring of the sub-network of the network.
- 12. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that further include:An evaluation module, suitable for the node in the label value source in the data to be assessed and the Bayesian network is carried out Match somebody with somebody, and the accurate of the data to be assessed is calculated according to the incidence relation between each two node to match in matching result Rate.
- 13. a kind of storage medium, is stored thereon with computer instruction, it is characterised in that is performed during the computer instruction operation Any one of claim 1 to 6 the step of Data Quality Analysis preprocess method.
- 14. a kind of terminal, including memory and processor, the meter that can be run on the processor is stored with the memory Calculation machine instructs, it is characterised in that perform claim requires any one of 1 to 6 institute when the processor runs the computer instruction The step of stating Data Quality Analysis preprocess method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711146673.8A CN108038131A (en) | 2017-11-17 | 2017-11-17 | Data Quality Analysis preprocess method and device, storage medium, terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711146673.8A CN108038131A (en) | 2017-11-17 | 2017-11-17 | Data Quality Analysis preprocess method and device, storage medium, terminal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038131A true CN108038131A (en) | 2018-05-15 |
Family
ID=62094069
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711146673.8A Pending CN108038131A (en) | 2017-11-17 | 2017-11-17 | Data Quality Analysis preprocess method and device, storage medium, terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038131A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108876470A (en) * | 2018-06-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | Tagging user extended method, computer equipment and storage medium |
CN109308332A (en) * | 2018-08-07 | 2019-02-05 | 腾讯科技(深圳)有限公司 | A kind of target user's acquisition methods, device and server |
CN110362829A (en) * | 2019-07-16 | 2019-10-22 | 北京百度网讯科技有限公司 | Method for evaluating quality, device and the equipment of structured patient record data |
CN113434746A (en) * | 2021-06-23 | 2021-09-24 | 深圳市酷开网络科技股份有限公司 | Data processing method based on user label, terminal equipment and storage medium |
CN113642986A (en) * | 2021-08-02 | 2021-11-12 | 上海示右智能科技有限公司 | Method for constructing digital notarization |
-
2017
- 2017-11-17 CN CN201711146673.8A patent/CN108038131A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108876470A (en) * | 2018-06-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | Tagging user extended method, computer equipment and storage medium |
CN109308332A (en) * | 2018-08-07 | 2019-02-05 | 腾讯科技(深圳)有限公司 | A kind of target user's acquisition methods, device and server |
CN109308332B (en) * | 2018-08-07 | 2022-05-20 | 腾讯科技(深圳)有限公司 | Target user acquisition method and device and server |
CN110362829A (en) * | 2019-07-16 | 2019-10-22 | 北京百度网讯科技有限公司 | Method for evaluating quality, device and the equipment of structured patient record data |
CN110362829B (en) * | 2019-07-16 | 2023-01-03 | 北京百度网讯科技有限公司 | Quality evaluation method, device and equipment for structured medical record data |
CN113434746A (en) * | 2021-06-23 | 2021-09-24 | 深圳市酷开网络科技股份有限公司 | Data processing method based on user label, terminal equipment and storage medium |
CN113434746B (en) * | 2021-06-23 | 2023-10-13 | 深圳市酷开网络科技股份有限公司 | User tag-based data processing method, terminal equipment and storage medium |
CN113642986A (en) * | 2021-08-02 | 2021-11-12 | 上海示右智能科技有限公司 | Method for constructing digital notarization |
CN113642986B (en) * | 2021-08-02 | 2024-04-16 | 上海示右智能科技有限公司 | Method for constructing digital notarization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038131A (en) | Data Quality Analysis preprocess method and device, storage medium, terminal | |
CN110147722A (en) | A kind of method for processing video frequency, video process apparatus and terminal device | |
CN104778173B (en) | Target user determination method, device and equipment | |
CN103761254B (en) | Method for matching and recommending service themes in various fields | |
CN107958317A (en) | A kind of method and apparatus that crowdsourcing participant is chosen in crowdsourcing project | |
CN108898476A (en) | A kind of loan customer credit-graded approach and device | |
CN110688478B (en) | Answer sorting method, device and storage medium | |
CN110096617B (en) | Video classification method and device, electronic equipment and computer-readable storage medium | |
CN111506820B (en) | Recommendation model, recommendation method, recommendation device, recommendation equipment and recommendation storage medium | |
CN105869016A (en) | Method for estimating click through rate based on convolution neural network | |
CN109597493A (en) | A kind of expression recommended method and device | |
CN107944911A (en) | A kind of recommendation method of the commending system based on text analyzing | |
CN109992781A (en) | Processing, device, storage medium and the processor of text feature | |
CN107368526A (en) | A kind of data processing method and device | |
CN110213660B (en) | Program distribution method, system, computer device and storage medium | |
CN111524043A (en) | Method and device for automatically generating litigation risk assessment questionnaire | |
CN108228950A (en) | A kind of information processing method and device | |
CN109543041A (en) | A kind of generation method and device of language model scores | |
CN110162769A (en) | Text subject output method and device, storage medium and electronic device | |
CN110765352B (en) | User interest identification method and device | |
CN112148994A (en) | Information push effect evaluation method and device, electronic equipment and storage medium | |
CN111523604A (en) | User classification method and related device | |
CN114048294B (en) | Similar population extension model training method, similar population extension method and device | |
CN109033078B (en) | The recognition methods of sentence classification and device, storage medium, processor | |
CN112541010A (en) | User gender prediction method based on logistic regression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180515 |