CN108038131A - Data Quality Analysis preprocess method and device, storage medium, terminal - Google Patents

Data Quality Analysis preprocess method and device, storage medium, terminal Download PDF

Info

Publication number
CN108038131A
CN108038131A CN201711146673.8A CN201711146673A CN108038131A CN 108038131 A CN108038131 A CN 108038131A CN 201711146673 A CN201711146673 A CN 201711146673A CN 108038131 A CN108038131 A CN 108038131A
Authority
CN
China
Prior art keywords
network
scoring
label value
node
incidence relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711146673.8A
Other languages
Chinese (zh)
Inventor
汤奇峰
王也
蒋宇
蒋宇一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Data Trading Center Ltd
Original Assignee
Shanghai Data Trading Center Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Data Trading Center Ltd filed Critical Shanghai Data Trading Center Ltd
Priority to CN201711146673.8A priority Critical patent/CN108038131A/en
Publication of CN108038131A publication Critical patent/CN108038131A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of Data Quality Analysis preprocess method and device, storage medium, terminal, Data Quality Analysis preprocess method include:Extract the label value source for the data that multiple suppliers provide;Node using the label value source and corresponding label value as network, and the incidence relation between each two node is determined according to the default incidence relation between the label value source, the incidence relation includes set membership and strength of association;The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is directed acyclic network;It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring, for data to be assessed are carried out with the assessment of accuracy rate.Technical solution of the present invention can build the Bayesian network for Data Quality Analysis, to improve the accuracy of Data Quality Analysis.

Description

Data Quality Analysis preprocess method and device, storage medium, terminal
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of Data Quality Analysis preprocess method and device, Storage medium, terminal.
Background technology
Data Quality Analysis is needed based on the contrast with data actual value, but the actual value in big data FIELD Data is past It is past to hardly result in.
Current existing method is mainly to determine data accuracy by way of separate sources side's ballot of data.Statistics Different information sides substantially judge data.For example the user for mobile equipment, mobile phone operators can be based on downloading Application program (Application, app) judge user be male;Love and marriage website can be based on the information that user makes a report on It is women to think user.
The prior art cannot be distinguished by the difference of the data providing quality for data assessment itself, and then cause to data The assessment of quality is inaccurate.
The content of the invention
Present invention solves the technical problem that it is how to be built using the label value basis for estimation of the data of supplier's offer Bayesian network, to improve the accuracy of Data Quality Analysis.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of Data Quality Analysis preprocess method, data matter Amount analysis preprocess method includes:Extract the label value source for the data that multiple suppliers provide;By the label value source with And node of the corresponding label value as network, and each two is determined according to the default incidence relation between the label value source Incidence relation between node, the incidence relation include set membership and strength of association;Calculate the sub-network of the network Likelihood score scores and complexity scoring, and the sub-network is directed acyclic network;Based on likelihood score scoring and the complexity It is Bayesian network that corresponding sub-network is chosen in degree scoring, for data to be assessed are carried out with the assessment of accuracy rate.
It is optionally, described that based on likelihood score scoring, corresponding sub-network is Bayes with complexity scoring selection Network includes:The sub-network for choosing the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network; Alternatively, it is the pattra leaves to choose the likelihood score scoring and the difference of complexity scoring more than one of sub-network of given threshold This network.
Optionally, the likelihood score scoring of the sub-network for calculating the network and complexity score and include:According to being taken The Bayesian network and the fitting degree of actual sample built calculate the likelihood score and score;According to possessing in the sub-network The number of nodes and node total number of set membership calculate the complexity scoring.
Optionally, the default incidence relation according between the label value source determines the pass between each two node Connection relation includes:The default incidence relation between the label value source is determined by searching for default incidence relation list, it is described Default incidence relation includes stating strength of association and sequencing between label value source;Will be default between the label value source Incidence relation is integrally formed the incidence relation between node.
Optionally, the likelihood score scoring and complexity that the sub-network of the network is calculated using bayesian information criterion are commented Point.
Optionally, the Data Quality Analysis preprocess method further includes:Label value in the data to be assessed is come Source is matched with the node in the Bayesian network, and according to the pass between each two node to match in matching result Connection relation calculates the accuracy rate of the data to be assessed.
The embodiment of the invention also discloses a kind of Data Quality Analysis pretreatment unit, Data Quality Analysis pretreatment unit Including:Label value source extraction module, suitable for extracting the label value source for the data that multiple suppliers provide;Node determines mould Block, suitable for the node using the label value source and corresponding label value as network, and according to the label value source it Between default incidence relation determine incidence relation between each two node, the incidence relation includes set membership and associates by force Degree;Computing module, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the sub-network is oriented Acyclic Network;Bayesian network determining module, it is corresponding suitable for being chosen based on likelihood score scoring and complexity scoring Sub-network is Bayesian network, for data to be assessed are carried out with the assessment of accuracy rate.
Optionally, the Bayesian network determining module includes:First chooses unit, suitable for choosing the likelihood score scoring Sub-network with the difference maximum of complexity scoring is the Bayesian network;Second choose unit, suitable for choose described in seemingly It is the Bayesian network that so the difference of degree scoring and complexity scoring, which is more than one of sub-network of given threshold,.
Optionally, the computing module includes:Likelihood score scoring computing unit, suitable for according to the Bayesian network built The likelihood score is calculated with the fitting degree of actual test sample to score;Complexity scoring computing unit, suitable for according to the son Possess the number of nodes of the set membership in network and node total number calculates the complexity scoring.
Optionally, the node determining module includes:Searching unit, suitable for being determined by searching for default incidence relation list Default incidence relation between the label value source, the default incidence relation include stating strength of association between label value source And sequencing;Integral unit, suitable for the default incidence relation between the label value source is integrally formed between node Incidence relation.
Optionally, the computing module calculates the likelihood score scoring of the sub-network of the network using bayesian information criterion Score with complexity.
Optionally, the Data Quality Analysis pretreatment unit further includes:Evaluation module, suitable for by the data to be assessed In label value source matched with the node in the Bayesian network, and according to each two to match in matching result Incidence relation between node calculates the accuracy rate of the data to be assessed.
The embodiment of the invention also discloses a kind of storage medium, is stored thereon with computer instruction, the computer instruction The step of Data Quality Analysis preprocess method is performed during operation.
The embodiment of the invention also discloses a kind of terminal, including memory and processor, being stored with the memory can The computer instruction run on the processor, the processor perform the quality of data when running the computer instruction The step of analyzing preprocess method.
Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that:
Technical solution of the present invention extracts the label value source for the data that multiple suppliers provide;By the label value source with And node of the corresponding label value as network, and each two is determined according to the default incidence relation between the label value source Incidence relation between node, the incidence relation include set membership and strength of association;Calculate the sub-network of the network Likelihood score scores and complexity scoring, and the sub-network is directed acyclic network;Based on likelihood score scoring and the complexity It is Bayesian network that corresponding sub-network is chosen in degree scoring, for data to be assessed are carried out with the assessment of accuracy rate.The present invention Technical solution forms network using the default incidence relation between label value source and label value source, and by the net Network carries out likelihood score scoring and complexity scoring to establish the Bayesian network of directed acyclic.The pattra leaves established using aforesaid way This network effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment;In addition, structure pattra leaves The complexity of network is considered during this network, compared with existing Bayesian network, volumes of searches in Data Quality Analysis compared with It is small.In addition, each of the Bayesian network that technical solution of the present invention utilizes branches into a naive Bayesian, reality can be solved Multi-level influence relation in operation, compared to Nae Bayesianmethod, Bayesian network has bright in terms of the accuracy of assessment Aobvious raising.
Further, the likelihood score is calculated according to the fitting degree for the Bayesian network and actual sample built to score; The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.This hair Bright technical solution calculates the likelihood score by using the fitting degree for the Bayesian network and actual sample built and scores, from And when choosing Bayesian network using likelihood score scoring, significantly more efficient Bayesian network can be chosen, after further improving The accuracy of continuous Data Quality Analysis;By using described in number of nodes and the node total number calculating for possessing the set membership Complexity scores, and can choose the lower Bayesian network of complexity when choosing Bayesian network using complexity scoring Network, so as to improve the efficiency of follow-up data quality analysis.
Further, default incidence relation between the label value source is determined by searching for default incidence relation list, The default incidence relation includes stating strength of association and sequencing between label value source;By between the label value source Default incidence relation is integrally formed the incidence relation between node.In technical solution of the present invention, default incidence relation can be pre- The prior information first established;Network is established using prior information so that it is less by the content of sample learning, so as to solve The problem of sample is insufficient in practical operation.
Brief description of the drawings
Fig. 1 is a kind of flow chart of Data Quality Analysis preprocess method of the embodiment of the present invention;
Fig. 2 is a kind of structure diagram of network of the embodiment of the present invention;
Fig. 3 is a kind of structure diagram of Bayesian network of the embodiment of the present invention;
Fig. 4 is a kind of structure diagram of Data Quality Analysis pretreatment unit of the embodiment of the present invention;
Fig. 5 is a kind of structure diagram of embodiment of Bayesian network determining module shown in Fig. 4 404;
Fig. 6 is a kind of structure diagram of embodiment of computing module 403 shown in Fig. 4;
Fig. 7 is a kind of structure diagram of embodiment of node determining module shown in Fig. 4 402.
Embodiment
As described in the background art, the prior art cannot be distinguished by the difference of the data providing quality for data assessment itself It is different, and then cause the assessment to the quality of data inaccurate.
Technical solution of the present invention forms net using the default incidence relation between label value source and label value source Network, and establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using upper The Bayesian network that the mode of stating is established effectively can assess the quality of data of data to be assessed, improve the accurate of assessment Property;In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in the quality of data Volumes of searches during analysis is smaller.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention Specific embodiment be described in detail.
Fig. 1 is a kind of flow chart of Data Quality Analysis preprocess method of the embodiment of the present invention.
The Data Quality Analysis preprocess method may comprise steps of:
Step S101:Extract the label value source for the data that multiple suppliers provide;
Step S102:Node using the label value source and corresponding label value as network, and according to the mark Default incidence relation between label value source determines the incidence relation between each two node, and the incidence relation is closed including father and son System and strength of association;
Step S103:The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is to have To Acyclic Network;
Step S104:It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring Network, for data to be assessed are carried out with the assessment of accuracy rate.
In the present embodiment, the data that supplier provides can include key assignments and its label value.Each label value is included at least One label value source.The label value source can represent the judgment basis of the label value.For example, for key assignments gender, its The label value source of label value male or female can include:Whether steady operation is had, whether to interested, beauty makeups of racing Class application program (Application, APP) opens frequency, sport category APP opens frequency, military class APP opens frequency etc..
In order to build the Bayesian network for Data Quality Analysis, in the specific implementation of step S101, extraction is multiple The label value source for the data that supplier provides.And in the specific implementation of step S102, utilize label value source and label Value structure network.The network includes multiple nodes, can possess incidence relation between node.The incidence relation includes father and son Relation and strength of association.Specifically, the set membership of node can represent the ordinal relation between node;Pass between node Connection intensity can be represented using conditional probability, can also be represented using proportionate relationship, the embodiment of the present invention does not limit this System.
In the lump with reference to Fig. 2, in the network architecture shown in Fig. 2, node Y represents label value " male or female ";Nodes X 1 represents Whether label value source " has steady operation ";Nodes X 2 represents label value source " whether interested in racing ";Nodes X 3 represents Label value source " beauty makeups class APP opens frequency ";Nodes X 4 represents label value source " sport category APP opens frequency ";Section Point X5 represents label value source " military class APP opens frequency ".
Due to possessing default incidence relation between label value source, thus may determine that the association between each node is closed System.For example, male has 10% to use makeups class APP, women has 50% to use makeups class APP, and only 10% women high frequency is opened Military class APP, and male's military affairs class APP opening rates are up to 60%.Thus, when node Y is male, nodes X 2 is nodes X 4 Father node, strength of association 10%, as shown in Fig. 2, connecting line is directed toward nodes X 4 by nodes X 2;Nodes X 2 is father's section of nodes X 5 Point, strength of association 60%, as shown in Fig. 2, connecting line is directed toward nodes X 5 by nodes X 2.When node Y is women, between node Set membership and strength of association and so on, details are not described herein again.
Further, incidence relation between node is determined according to the default incidence relation between label value source, with mark Exemplified by label value source " sport category APP opens frequency " and label value source " military class APP opens frequency ", default incidence relation refers to Show that 60% probability both sides relation is not known, 40% probability tag value source " sport category APP opens frequency " influences label value Source " military class APP opens frequency ", and 30% probability tag value source " military class APP opens frequency " influence label value Source " sport category APP opens frequency ".With continued reference to Fig. 2, the set membership between nodes X 4 and nodes X 5 be it is two-way, two Node interacts.
By the network of step S102 structures there may be cyclic structure and there are bidirectional relationship, Bayesian network is not met The characteristic of network directed acyclic.Therefore above-mentioned network is modified in the specific implementation of step S103 and step S104, determine shellfish This network of leaf.
Specifically, the network can be disassembled as the sub-network of multiple directed acyclics, and each sub-network is calculated Likelihood score scores and complexity scoring.Likelihood score, which scores, can characterize the fitting journey of built Bayesian network and actual sample Degree;To characterize the structure complexity of sub-network, the structure complexity of sub-network can influence subsequently to carry out data to comment for complexity scoring The efficiency estimated.Furthermore, likelihood score scoring is higher, and the degree of fitting of model and actual sample is higher;Complexity scoring is higher, The structure complexity of sub-network is higher.
Preferably, bayesian information criterion (Bayesian Information Criterion, BIC) algorithm can be used Network is modified, determines Bayesian network.The advantage of BIC algorithms, which is to examine by using conditional independence, to be found The dependency structure of network.BIC algorithms divide two parts, and a part scores likelihood score, while another part is to complexity button Point.If selecting network only in accordance with likelihood score scoring, most complicated complete Bayesian network can be chosen, causes overfitting.Cause And the scoring preference pattern based on BIC needs to select and data fitting and better simply model.In the above described manner can with to net Bidirectional relationship and annular section in network are rejected, and ensure the availability of Bayesian network.
Cooper and Herskovits (Cooper&Herskovits, CH) algorithm can also be used to be modified network, really Determine Bayesian network.It will be apparent to a skilled person that any other enforceable existing algorithm can also be used real Existing above-mentioned purpose, the embodiment of the present invention are without limitation.
Specifically, in the lump with reference to Fig. 2 and Fig. 3.After step S103 and step S104, definite Bayesian network is such as Shown in Fig. 3.Compared with network shown in Fig. 2, Bayesian network is directed acyclic graph (Directed Acyclic Graph, DAG). After amendment, nodes X 5 and nodes X 3 do not have incidence relation;Nodes X 5 is the child node of nodes X 4;Nodes X 4 is nodes X 3 Child node.
The Bayesian network that the embodiment of the present invention determines can include representing node and connect these node directed edges.Node Stochastic variable can be represented, the incidence relation that the directed edge between node can be represented between node (is directed toward its sub- section by father node Point), expression strength of association is carried out with conditional probability, strength of association is expressed without father node prior probability.Node can be Any problem is abstracted, such as:Z test values, observation phenomenon, opinion are seeked the opinion of.Bayesian network can be used for expressing and analyze Uncertain and probabilistic event, applied to conditionally rely on various control factor decision-making, can from not exclusively, it is not smart Reasoning is made in true or uncertain knowledge or information.
The embodiment of the present invention forms network using the default incidence relation between label value source and label value source, And establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using above-mentioned side The Bayesian network that formula is established effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment; In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in Data Quality Analysis When volumes of searches it is smaller.
The present embodiment can be judged to be evaluated after Bayesian network is established by the incidence relation of the node of Bayesian network Estimate the accuracy of data.
Preferably, step S104 may comprise steps of:Choose the likelihood score scoring and the complexity scores it Poor maximum sub-network is the Bayesian network;Alternatively, the difference for choosing the likelihood score scoring and complexity scoring is big In one of sub-network of given threshold be the Bayesian network.
In the present embodiment, it can score with reference to likelihood score and complexity scoring chooses Bayes in multiple sub-networks Network.As it was previously stated, likelihood score scoring can characterize the fitting degree of built Bayesian network and actual sample, Ge Gejie Strength of association between point can influence the accuracy of follow-up data assessment;Complexity scores to characterize the complicated of sub-network Degree, the structure complexity of sub-network can influence subsequently to carry out the efficiency of data assessment.Therefore the pass of Bayesian network in order to balance Join intensity and complexity, the sub-network that can choose the difference maximum of likelihood score scoring and complexity scoring is the Bayes Network, or it is the Bayesian network that the difference of likelihood score scoring and complexity scoring, which is more than one of sub-network of given threshold, Network.
Preferably, step S103 may comprise steps of:The likelihood score scoring is calculated according to sample fitting degree;Root The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.
In the present embodiment, the network and actual test sample fitting degree built are stronger, and the likelihood score scoring is higher;Institute State the number of nodes for possessing the set membership in sub-network and/or node total number is bigger, the complexity scoring is higher.
It is possible to further calculate the scoring of the likelihood score of the sub-network of the network and complexity using bayesian information criterion Degree scoring.
Preferably, step S102 may comprise steps of:The label is determined by searching for default incidence relation list The default incidence relation being worth between source, the default incidence relation include stating between label value source strength of association and successively suitable Sequence;The incidence relation default incidence relation between the label value source being integrally formed between node.
In the embodiment of the present invention, default incidence relation can be the prior information pre-established;Established using prior information Network so that it is less by the content of sample learning, so as to solve the problems, such as that sample is insufficient in practical operation.
In a concrete application scene of the invention, default incidence relation can be by multiple experts to the label value come Assessed with formation in source.With label value source " sport category APP opens frequency " (hereinafter referred to as the former) and label value source Exemplified by " military class APP opens frequency " (hereinafter referred to as the latter), its default incidence relation can have following several:60% probability Both sides relation is not known, 40% probability the former influence the latter;70% probability both sides relation is not known, 30% probability the latter Influence the former;40% probability both sides relation is not known, and both 30% probability are unrelated, 30% probability the former influence the latter.
The incidence relation of two nodes can be determined using DS evidence theories (D-S evidence theory).It is specific and Speech, further determines that the evidence relation between each two node according to DS combining evidences formula, lists the knowledge of each two node Fusion results, to form network.
In a preferred embodiment of the invention, the Data Quality Analysis preprocess method shown in Fig. 1 can also include following Step:Label value source in the data to be assessed is matched with the node in the Bayesian network, and according to The accuracy rate of the data to be assessed is calculated with the incidence relation between each two node to match in result.
In the present embodiment, for each data to be assessed, the node in its label value source and Bayesian network is compared It is right;Calculate matched node and accurate label value is determined to the conditional probability of label value node.If the label value of data to be assessed Identical with accurate label value, then the data to be assessed are accurate.
For multiple data to be assessed, whether accuracy rate can be accurately calculated according to each data to be assessed.
In another of the invention concrete application scene, when establishing Bayesian network, first according in advance to label value come Relation between source carries out the default incidence relation of analysis formation, and each two node is further determined that using DS combining evidences formula Between evidence relation, list the knowledge fusion result of wherein each two node.Further, it is also possible to part is removed without practical significance Causality, obtains initial model.For the uncertain situation of incidence relation between node, can add between two nodes Add two-way side.The initial data that multiple suppliers provide is then based on, bayesian network structure is carried out using algorithm Practise.Final Bayesian network is obtained by BIC algorithm evaluations again.Compared with the common searching algorithm of existing Bayesian network, Volumes of searches is smaller.
Present inventor's effect of the application by verification experimental verification.200 cellies' of investigation is true under line Gender data, wherein 100 model trainings (parameter regulation) for being respectively used to naive Bayesian and Bayesian network, residue 100 Data is used for on-line testing, to obtain test result.Test result is shown, is using the False Rate of Nae Bayesianmethod 9%, and the use of the False Rate of Bayesian network is only 4%.Thus, assessed compared to Nae Bayesianmethod, Bayesian network Accuracy in terms of tool be significantly improved.
Fig. 4 is refer to, Data Quality Analysis pretreatment unit 40 can include:
Label value source extraction module 401, suitable for extracting the label value source for the data that multiple suppliers provide;
Node determining module 402, suitable for the node using the label value source and corresponding label value as network, and Incidence relation between each two node, the incidence relation are determined according to the default incidence relation between the label value source Including set membership and strength of association;
Computing module 403, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the subnet Network is directed acyclic network;
Bayesian network determining module 404, suitable for corresponding with complexity scoring selection based on likelihood score scoring Sub-network be Bayesian network, for data to be assessed carry out accuracy rate assessment.
The embodiment of the present invention forms network using the default incidence relation between label value source and label value source, And establish the Bayesian network of directed acyclic by carrying out likelihood score scoring and complexity scoring to the network.Using above-mentioned side The Bayesian network that formula is established effectively can assess the quality of data of data to be assessed, improve the accuracy of assessment; In addition, the complexity of network is considered during structure Bayesian network, compared with existing Bayesian network, in Data Quality Analysis When volumes of searches it is smaller.
Preferably, Fig. 5 is refer to, Bayesian network determining module 404 can include first and choose unit 4041, suitable for choosing The sub-network for taking the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network;Second chooses unit 4042, it is described suitable for choosing the difference of the likelihood score scoring and complexity scoring more than one of sub-network of given threshold Bayesian network.
Preferably, Fig. 6 is refer to, computing module 403 can include likelihood score scoring computing unit 4031, suitable for according to every Strength of association between two nodes calculates the likelihood score scoring;Complexity scoring computing unit 4032, suitable for according to Possess the number of nodes of the set membership in sub-network and node total number calculates the complexity scoring.
Further, the computing module 403 can utilize the sub-network of the bayesian information criterion calculating network Likelihood score scores and complexity scoring.
Preferably, Fig. 7 is refer to, node determining module 402 can include searching unit 4021, suitable for by searching for default Incidence relation list determines the default incidence relation between the label value source, and the default incidence relation includes stating label value Strength of association and sequencing between source;Integral unit 4022, suitable for the default association between the label value source is closed System is integrally formed the incidence relation between node.
Data Quality Analysis pretreatment unit shown in Fig. 4 40 can also include evaluation module (not shown), suitable for by described in Label value source in data to be assessed is matched with the node in the Bayesian network, and according to phase in matching result Incidence relation between each two node matched somebody with somebody calculates the accuracy rate of the data to be assessed.
More contents of operation principle, working method on the Data Quality Analysis pretreatment unit 40, Ke Yican According to the associated description in Fig. 1 to Fig. 3, which is not described herein again.
The embodiment of the invention also discloses a kind of storage medium, is stored thereon with computer instruction, the computer instruction The step of Data Quality Analysis preprocess method shown in Fig. 1 can be performed during operation.The storage medium can include ROM, RAM, disk or CD etc..
The embodiment of the invention also discloses a kind of terminal, the terminal can include memory and processor, the storage The computer instruction that can be run on the processor is stored with device.The processor can be with when running the computer instruction The step of performing the Data Quality Analysis preprocess method shown in Fig. 1.The terminal include but not limited to mobile phone, computer, The terminal devices such as tablet computer.
Although present disclosure is as above, the present invention is not limited to this.Any those skilled in the art, are not departing from this In the spirit and scope of invention, it can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the scope of restriction.

Claims (14)

  1. A kind of 1. Data Quality Analysis preprocess method, it is characterised in that including:
    Extract the label value source for the data that multiple suppliers provide;
    Node using the label value source and corresponding label value as network, and according between the label value source Default incidence relation determines the incidence relation between each two node, and the incidence relation includes set membership and strength of association;
    The likelihood score scoring and complexity scoring of the sub-network of the network are calculated, the sub-network is directed acyclic network;
    It is Bayesian network to choose corresponding sub-network based on likelihood score scoring and complexity scoring, for treating Assess the assessment that data carry out accuracy rate.
  2. 2. Data Quality Analysis preprocess method according to claim 1, it is characterised in that described to be based on the likelihood score Corresponding sub-network is chosen in scoring and complexity scoring to be included for Bayesian network:
    The sub-network for choosing the difference maximum of the likelihood score scoring and complexity scoring is the Bayesian network;
    Alternatively, it is described to choose the likelihood score scoring and the difference of complexity scoring more than one of sub-network of given threshold Bayesian network.
  3. 3. Data Quality Analysis preprocess method according to claim 1, it is characterised in that the calculating network The likelihood score scoring and complexity scoring of sub-network include:
    The likelihood score is calculated according to the fitting degree of network and actual test sample to score;
    The complexity scoring is calculated according to the number of nodes and node total number for possessing the set membership in the sub-network.
  4. 4. Data Quality Analysis preprocess method according to claim 1, it is characterised in that described according to the label value Default incidence relation between source determines that the incidence relation between each two node includes:
    The default incidence relation between the label value source, the default association are determined by searching for default incidence relation list Relation includes stating strength of association and sequencing between label value source;
    The incidence relation default incidence relation between the label value source being integrally formed between node.
  5. 5. Data Quality Analysis preprocess method according to claim 1, it is characterised in that utilize bayesian information criterion Calculate the likelihood score scoring and complexity scoring of the sub-network of the network.
  6. 6. Data Quality Analysis preprocess method according to claim 1, it is characterised in that further include:
    Label value source in the data to be assessed is matched with the node in the Bayesian network, and according to matching As a result the incidence relation between each two node to match in calculates the accuracy rate of the data to be assessed.
  7. A kind of 7. Data Quality Analysis pretreatment unit, it is characterised in that including:
    Label value source extraction module, suitable for extracting the label value source for the data that multiple suppliers provide;
    Node determining module, suitable for the node using the label value source and corresponding label value as network, and according to institute State the default incidence relation between label value source and determine incidence relation between each two node, the incidence relation includes father Subrelation and strength of association;
    Computing module, suitable for calculating the scoring of the likelihood score of the sub-network of the network and complexity scoring, the sub-network is to have To Acyclic Network;
    Bayesian network determining module, suitable for choosing corresponding sub-network based on likelihood score scoring and complexity scoring For Bayesian network, for data to be assessed are carried out with the assessment of accuracy rate.
  8. 8. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the Bayesian network determines Module includes:
    First chooses unit, is described suitable for choosing the sub-network of the difference maximum of the likelihood score scoring and complexity scoring Bayesian network;
    Second chooses unit, suitable for choosing the subnet that the difference of the likelihood score scoring and complexity scoring is more than given threshold One of network is the Bayesian network.
  9. 9. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the computing module includes:
    Likelihood score scoring computing unit, scores suitable for calculating the likelihood score according to the strength of association between each two node;
    Complexity scoring computing unit, suitable for according to the number of nodes and node for possessing the set membership in the sub-network Sum calculates the complexity scoring.
  10. 10. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the node determining module Including:
    Searching unit, suitable for determining that the default association between the label value source is closed by searching for default incidence relation list System, the default incidence relation include stating strength of association and sequencing between label value source;
    Integral unit, the association suitable for being integrally formed the default incidence relation between the label value source between node are closed System.
  11. 11. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that the computing module utilizes Bayesian information criterion calculates the likelihood score scoring and complexity scoring of the sub-network of the network.
  12. 12. Data Quality Analysis pretreatment unit according to claim 7, it is characterised in that further include:
    An evaluation module, suitable for the node in the label value source in the data to be assessed and the Bayesian network is carried out Match somebody with somebody, and the accurate of the data to be assessed is calculated according to the incidence relation between each two node to match in matching result Rate.
  13. 13. a kind of storage medium, is stored thereon with computer instruction, it is characterised in that is performed during the computer instruction operation Any one of claim 1 to 6 the step of Data Quality Analysis preprocess method.
  14. 14. a kind of terminal, including memory and processor, the meter that can be run on the processor is stored with the memory Calculation machine instructs, it is characterised in that perform claim requires any one of 1 to 6 institute when the processor runs the computer instruction The step of stating Data Quality Analysis preprocess method.
CN201711146673.8A 2017-11-17 2017-11-17 Data Quality Analysis preprocess method and device, storage medium, terminal Pending CN108038131A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711146673.8A CN108038131A (en) 2017-11-17 2017-11-17 Data Quality Analysis preprocess method and device, storage medium, terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711146673.8A CN108038131A (en) 2017-11-17 2017-11-17 Data Quality Analysis preprocess method and device, storage medium, terminal

Publications (1)

Publication Number Publication Date
CN108038131A true CN108038131A (en) 2018-05-15

Family

ID=62094069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711146673.8A Pending CN108038131A (en) 2017-11-17 2017-11-17 Data Quality Analysis preprocess method and device, storage medium, terminal

Country Status (1)

Country Link
CN (1) CN108038131A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876470A (en) * 2018-06-29 2018-11-23 腾讯科技(深圳)有限公司 Tagging user extended method, computer equipment and storage medium
CN109308332A (en) * 2018-08-07 2019-02-05 腾讯科技(深圳)有限公司 A kind of target user's acquisition methods, device and server
CN110362829A (en) * 2019-07-16 2019-10-22 北京百度网讯科技有限公司 Method for evaluating quality, device and the equipment of structured patient record data
CN113434746A (en) * 2021-06-23 2021-09-24 深圳市酷开网络科技股份有限公司 Data processing method based on user label, terminal equipment and storage medium
CN113642986A (en) * 2021-08-02 2021-11-12 上海示右智能科技有限公司 Method for constructing digital notarization

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876470A (en) * 2018-06-29 2018-11-23 腾讯科技(深圳)有限公司 Tagging user extended method, computer equipment and storage medium
CN109308332A (en) * 2018-08-07 2019-02-05 腾讯科技(深圳)有限公司 A kind of target user's acquisition methods, device and server
CN109308332B (en) * 2018-08-07 2022-05-20 腾讯科技(深圳)有限公司 Target user acquisition method and device and server
CN110362829A (en) * 2019-07-16 2019-10-22 北京百度网讯科技有限公司 Method for evaluating quality, device and the equipment of structured patient record data
CN110362829B (en) * 2019-07-16 2023-01-03 北京百度网讯科技有限公司 Quality evaluation method, device and equipment for structured medical record data
CN113434746A (en) * 2021-06-23 2021-09-24 深圳市酷开网络科技股份有限公司 Data processing method based on user label, terminal equipment and storage medium
CN113434746B (en) * 2021-06-23 2023-10-13 深圳市酷开网络科技股份有限公司 User tag-based data processing method, terminal equipment and storage medium
CN113642986A (en) * 2021-08-02 2021-11-12 上海示右智能科技有限公司 Method for constructing digital notarization
CN113642986B (en) * 2021-08-02 2024-04-16 上海示右智能科技有限公司 Method for constructing digital notarization

Similar Documents

Publication Publication Date Title
CN108038131A (en) Data Quality Analysis preprocess method and device, storage medium, terminal
CN110147722A (en) A kind of method for processing video frequency, video process apparatus and terminal device
CN104778173B (en) Target user determination method, device and equipment
CN103761254B (en) Method for matching and recommending service themes in various fields
CN107958317A (en) A kind of method and apparatus that crowdsourcing participant is chosen in crowdsourcing project
CN108898476A (en) A kind of loan customer credit-graded approach and device
CN110688478B (en) Answer sorting method, device and storage medium
CN110096617B (en) Video classification method and device, electronic equipment and computer-readable storage medium
CN111506820B (en) Recommendation model, recommendation method, recommendation device, recommendation equipment and recommendation storage medium
CN105869016A (en) Method for estimating click through rate based on convolution neural network
CN109597493A (en) A kind of expression recommended method and device
CN107944911A (en) A kind of recommendation method of the commending system based on text analyzing
CN109992781A (en) Processing, device, storage medium and the processor of text feature
CN107368526A (en) A kind of data processing method and device
CN110213660B (en) Program distribution method, system, computer device and storage medium
CN111524043A (en) Method and device for automatically generating litigation risk assessment questionnaire
CN108228950A (en) A kind of information processing method and device
CN109543041A (en) A kind of generation method and device of language model scores
CN110162769A (en) Text subject output method and device, storage medium and electronic device
CN110765352B (en) User interest identification method and device
CN112148994A (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN111523604A (en) User classification method and related device
CN114048294B (en) Similar population extension model training method, similar population extension method and device
CN109033078B (en) The recognition methods of sentence classification and device, storage medium, processor
CN112541010A (en) User gender prediction method based on logistic regression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180515