CN106844718A - The determination method and apparatus of data acquisition system - Google Patents

The determination method and apparatus of data acquisition system Download PDF

Info

Publication number
CN106844718A
CN106844718A CN201710069739.1A CN201710069739A CN106844718A CN 106844718 A CN106844718 A CN 106844718A CN 201710069739 A CN201710069739 A CN 201710069739A CN 106844718 A CN106844718 A CN 106844718A
Authority
CN
China
Prior art keywords
data
attribute
instance
attribute set
acquisition system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710069739.1A
Other languages
Chinese (zh)
Other versions
CN106844718B (en
Inventor
何彬彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710069739.1A priority Critical patent/CN106844718B/en
Publication of CN106844718A publication Critical patent/CN106844718A/en
Application granted granted Critical
Publication of CN106844718B publication Critical patent/CN106844718B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2425Iterative querying; Query formulation based on the results of a preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of determination method and apparatus of data acquisition system.Wherein, the method includes:The instruction for indicating the acquisition target data set from multiple first data acquisition systems is received, the data of target data set are used to carry out data analysis;The destination probability of each instance data in each first data acquisition system is determined according to the first attribute set, destination probability belongs to the probability of target type for instance data, and the first attribute set includes the attribute of the data for being designated as target type;Destination probability based on all instance datas in each first data acquisition system determines the acquisition quality information of each the first data acquisition system, and acquisition quality information is used for the quality of the first data acquisition system that instruction is collected according to target type;It is determined that what acquisition quality information met preset quality requirement in multiple first data acquisition systems is target data set for carrying out data analysis.The present invention solves the technical problem that quality data acquisition system higher cannot be got in correlation technique.

Description

The determination method and apparatus of data acquisition system
Technical field
The present invention relates to data analysis field, in particular to a kind of determination method and apparatus of data acquisition system.
Background technology
Data analysis refer to appropriate statistical analysis technique to collect come mass data be analyzed, extract useful letter Cease and form conclusion and data are subject to the process of research and summary in detail.Data analysis generally has with computer science Close, and by statistics, Data Environments, information retrieval, machine learning, expert system (relying on the past rule of thumb) and mould All multi-methods such as formula identification realize above-mentioned target.
Data analysis has a very wide range of range of application.Typical data analysis potentially includes following steps:
Step 1, data acquisition gathers many number evidences according to set mode, then higher using wherein confidence level A or many number evidences carry out data analysis.
Step 2, exploratory data analysis, when data are just obtained, may be disorderly and unsystematic, rule is not seen, by mapping, Make a list, use various forms of equation models, calculate the possibility form of the means exploring law such as some characteristic quantities, i.e., toward what side To with looked for which kind of mode and disclose the regularity lain in data.
Step 3, the selected analysis of model, proposes a class or the possible model of a few classes, then on the basis of exploratory analysis Certain model is therefrom selected by further analysis.
Step 4, inference analysis, usually using mathematical statistics method to institute's cover half type or the degree of reliability and accurate journey of estimation Degree draws an inference.
Step 1 seems increasingly important during whole data analysis, only have chosen confidence level data higher It is possible to obtain accurate data results.
After data acquisition, many number evidences are obtained, chosen wherein quality a or many number higher according to right Vital effect is played in data analysis, if choosing to the more data of noise data, be will result directly in data analysis and is obtained The result of mistake.At present, selection data are mainly random selection or user is rule of thumb selected, and may choose to quality Relatively low data.
For the problem that quality data acquisition system higher cannot be got in correlation technique, effective solution is not yet proposed at present Certainly scheme.
The content of the invention
A kind of determination method and apparatus of data acquisition system are the embodiment of the invention provides, at least to solve nothing in correlation technique Method gets the technical problem of quality data acquisition system higher.
One side according to embodiments of the present invention, there is provided a kind of determination method of data acquisition system, including:Receive use In indicate from multiple first data acquisition systems obtain target data set instruction, wherein, the first data acquisition system include according to At least one instance data that target type is collected, the data of target data set are used to carry out data analysis;According to first Attribute set determines the destination probability of each instance data in each first data acquisition system, wherein, destination probability is instance data Belong to the probability of target type, the first attribute set includes the attribute of the data for being designated as target type;Based on each The destination probability of all instance datas determines the acquisition quality information of each the first data acquisition system in first data acquisition system, wherein, Acquisition quality information is used for the quality of the first data acquisition system that instruction is collected according to target type;It is determined that multiple first data sets What acquisition quality information met preset quality requirement in conjunction is the target data set for carrying out data analysis.
Another aspect according to embodiments of the present invention, additionally provides a kind of determining device of data acquisition system, including:Receive single Unit, for receiving the instruction for indicating the acquisition target data set from multiple first data acquisition systems, wherein, the first data Set includes at least one instance data collected according to target type, and the data of target data set are used to carry out data Analysis;First determining unit, the mesh for determining each instance data in each first data acquisition system according to the first attribute set Mark probability, wherein, destination probability belongs to the probability of target type for instance data, and the first attribute set is included for being designated as The attribute of the data of target type;Second determining unit, for the mesh based on all instance datas in each first data acquisition system The acquisition quality information of each the first data acquisition system of determine the probability is marked, wherein, acquisition quality information is used to indicate according to target class The quality of the first data acquisition system that type is collected;3rd determining unit, for determining acquisition quality in multiple first data acquisition systems What information met preset quality requirement is the target data set for carrying out data analysis.
In embodiments of the present invention, receiving for indicating to obtain target data set from multiple first data acquisition systems Instruction when, determine that the instance data in the first data acquisition system belongs to the probability of target type by the first attribute set, so The destination probability based on all instance datas in the first data acquisition system determines the acquisition quality information of the first data acquisition system afterwards, and from In select and meet the target data set of preset quality requirement and share in data analysis is carried out, can solve cannot in correlation technique The technical problem of quality data acquisition system higher is got, and then reaches the technology effect for getting quality data acquisition system higher Really, it is ensured that the reliability of data results.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this hair Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is the schematic diagram of the hardware environment of the determination method of data acquisition system according to embodiments of the present invention;
Fig. 2 is the flow chart of the determination method of a kind of optional data acquisition system according to embodiments of the present invention;
Fig. 3 is the schematic diagram of the software module of the determination method of data acquisition system according to embodiments of the present invention;
Fig. 4 is the flow chart of the determination method of a kind of optional data acquisition system according to embodiments of the present invention;
Fig. 5 is the schematic diagram of the determining device of a kind of optional data acquisition system according to embodiments of the present invention;
Fig. 6 is the schematic diagram of the determining device of a kind of optional data acquisition system according to embodiments of the present invention;
Fig. 7 is the schematic diagram of the determining device of a kind of optional data acquisition system according to embodiments of the present invention;And
Fig. 8 is a kind of structured flowchart of terminal according to embodiments of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, should all belong to the model of present invention protection Enclose.
It should be noted that term " first ", " in description and claims of this specification and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so using Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.Additionally, term " comprising " and " having " and their any deformation, it is intended that cover Lid is non-exclusive to be included, for example, the process, method, system, product or the equipment that contain series of steps or unit are not necessarily limited to Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or other intrinsic steps of equipment or unit.
First, the part noun or term for occurring during being described to the embodiment of the present invention are applied to as follows Explain:
The formal definitions of body are divided into two kinds, five-tuple or seven tuples, and the definition of five-tuple is O=(C, R, HC,Rel, Ao), C is that the set of concept, R are the set of relation, HCRepresent that the level of concept, Rel represent relation, A between conceptoRepresent this Body axiom;The definition of seven tuples is O={ C, AC,R,AR, H, I, X }, C is the set of concept, AoSet, the R for being concept attribute be The set of relation, ARBe the set of attribute of a relation, the set of H representational levels, I be the set of example, X be axiom set.
Concept (Concepts) is also called class (class), is the set of the object with same nature in a certain field, Such as:Animal, people, tissue, pass through pre defined attribute " rdf in RDF:Class " is defined.Other RDFS (Resource Description Framework Schema) in additionally provide predefined class and represent some simple data types, such as Integer (xs:Integer), character string (xs:String) etc..
Example (Instances) is the materialization of certain concept or class, for example, Obama is the example of concept " people ".
RDF(Resource Description Framework):World wide web tissue (W3C) was proposed in 1999 Standard language resource description framework RDF, RDF for describing WWW resource are a kind of main ontology description languages, and it is Various applications on internet provide the specification of information description.RDF with triple form "<Subject, predicate, object>" describe Resource on Web, it has also become one of standard of ontology describing, is widely used in the description of semantic net and metadata.
Is-a relations:As a rule RDF knowledge bases are divided into two parts of TBox and ABox, and TBox expresses general in knowledge base Relation between thought, and the is-a relations in TBox are then to express the hyponymy between concept, i.e. subclass-of relations, For example:Subclass-of (Writer, Person) expression be " Writer " be " Person " subclass.It is different from TBox, and ABox mainly contains the relation between example, and the is-a relations in ABox then represent that example belongs to certain concept, i.e. instance- Of relations, such as Tom are that an example of Person is typically expressed as instance-of (Tom, Person).subclass-of Relation it is abstract be in order to formalization expression key concept between hierarchical structure.The reflection of instance-of relations is real Example and the relation of classification, are bases that conceptual level and instance layer are contacted.Therefore is-a relations are some key technologies in body Basis, such as:Reasoning, consistency detection etc..
It should be noted that in body is-a relations reflection be example and classification relation, be conceptual level and example The basis of layer contact, such as Tom is that an example of Person is typically expressed as Tom is-a Person, Tom and is not The example of Organization, can be referred to as the counter-example of Organization by Tom.
With in RDF, Instance is-a Type are that Type (a) asserts conventional expression way in body.In unitary In relation Type (a), a represents the example information in knowledge base, and Type represents the classification or conceptual information in knowledge base, the unitary Relation is asserted we term it Type.
Embodiment 1
According to embodiments of the present invention, there is provided the embodiment of the method for a kind of determination method of data acquisition system.
Alternatively, in the present embodiment, the determination method of above-mentioned data acquisition system can apply to as shown in Figure 1 by servicing In the hardware environment that device 102 and terminal 104 are constituted.As shown in figure 1, server 102 is connected by network with terminal 104 Connect, above-mentioned network is included but is not limited to:Wide area network, Metropolitan Area Network (MAN) or LAN, terminal 104 are not limited to PC, mobile phone, flat board electricity Brain etc..The method of the embodiment of the present invention can be performed by server 102, it is also possible to be performed by terminal 104, can also be by Server 102 and terminal 104 are performed jointly.Wherein, the method that terminal 104 performs the embodiment of the present invention can also be by being arranged on Client thereon is performed.
Fig. 2 is the flow chart of the determination method of a kind of optional data acquisition system according to embodiments of the present invention, such as Fig. 2 institutes Show, the method may comprise steps of:
Step S202, receives the instruction for indicating the acquisition target data set from multiple first data acquisition systems, the One data acquisition system includes at least one instance data collected according to target type, the data of target data set be used for into Row data analysis;
Step S204, determines that the target of each instance data in each first data acquisition system is general according to the first attribute set Rate, destination probability belongs to the probability of target type for instance data, and the first attribute set is included for being designated as target type Data attribute;
Step S206, the destination probability based on all instance datas in each first data acquisition system determines each first data The acquisition quality information of set, acquisition quality information is used for the matter of the first data acquisition system that instruction is collected according to target type Amount;
Step S208, it is determined that what acquisition quality information met preset quality requirement in multiple first data acquisition systems is for entering The target data set of row data analysis.
By above-mentioned steps S202 to step S208, receiving for indicating to obtain mesh from multiple first data acquisition systems When marking the instruction of data acquisition system, determine that the instance data in the first data acquisition system belongs to target type by the first attribute set Probability, the destination probability for being then based on all instance datas in the first data acquisition system determines the acquisition quality of the first data acquisition system Information, and therefrom select and meet the target data set of preset quality requirement and share in data analysis is carried out, correlation can be solved The technical problem of quality data acquisition system higher cannot be got in technology, and then is reached and is got quality data acquisition system higher Technique effect, it is ensured that the reliability of data results.
Above-mentioned data acquisition system is that what is collected meet the example of is-a relations according to target type (i.e. concept or class) Data, the acquisition mode of instance data can be that extraction automatically or the mode using isomer data integration are obtained, such as knowledge base DBpeida obtains instance data by extracting the page of wikipedia Wikipedia.
Above-mentioned data analysis refers in finding and disclose by being excavated to data and being processed and lying in data Rule.
The attribute information that the first above-mentioned attribute set includes is can be for the attribute that describes above-mentioned target type Information, can interpolate that out whether instance data belongs to target type by these attribute informations.
Above-mentioned acquisition quality information can be that the collection of all instance datas in describing the first data acquisition system is accurate The information of the acquisition qualities such as degree, distribution situation, hybrid UV curing.
In embodiments herein, the method for being used can be used for Data processing, for being sieved in from many numbers The preferable data of acquisition quality are selected, is mainly included the following steps that:Excavated by Mining class association rules and obtain the one of each concept C Individual or multiple judgement property sets, and calculate the confidence level that the judgement collection belongs to concept C;Attribute and each concept C according to example Judgement property set matched, obtain the confidence level of each example is-a relations;Estimate to evaluate this by two proposed The quality of concept in body.With reference to Fig. 2 in detail embodiments herein is described in detail:
In the technical scheme that step S202 is provided, during user carries out data analysis, collection must be obtained first Preferable data are instructed, the process of acquisition can be that automatic acquisition, i.e. computer can be received for indicating to be counted from multiple first According to the instruction that target data set is obtained in set.
It is every in each first data acquisition system is determined according to the first attribute set in the technical scheme that step S204 is provided Before the destination probability of individual instance data, the second data acquisition system is obtained, wherein, each data in the second data acquisition system belong to In target type;Data mining is carried out by the second data acquisition system, the first attribute set is obtained.
Using the technical scheme of the application, the matter of is-a relations (i.e. target type or concept) in a body can be assessed Amount, in this process, significant challenge is the example for how finding out is-a relation mistakes in body, and applicant is by carefully grinding Found after studying carefully, each concept CiHave and only contain an attribute set Pc={ p1, p2..., pn, PcIt is attribute P in knowledge base A subset, then can there is at least one PcSubset D Pc, can be used for describing concept Ci, then DPcCan be described as judging Property set (i.e. the first attribute set), if the attribute of example belongs to certain concept CiJudgement property set, then the example be likely to Belong to CiOtherwise, the example is then likely to noise data.For example, the example for Country, usually contains Caption (i.e. Capital) attribute, and the example for Person usually contains Birthday (birthday) this attribute, it can be seen from general knowledge, one Individual country is that, containing capital, and people then has the birthday of himself, if the example of a Country contain Birthday this Individual attribute then the example there is a strong possibility is a noise data.
Mining class association rules mining algorithm can be used, the judgement property set of each class is calculated, then using matched rule Example and concept are matched, and will match to judge that the confidence level of property set is general as the posteriority that the example belongs to the concept Rate, i.e. destination probability.
Definitions example E (a1, a2…an), wherein aiIt is the attribute of example E (i.e. target type), then E belongs to class CiProbability p (Ci│ e)=p (Ci│a1, a2…an)。p(Ci│a1, a2…an) can be tried to achieve by statistics.Because data are unreliable in itself and exist Atypia attribute, if directly statistics occurs larger error, atypia attribute refers to the extremely low frequency of occurrences, and can not express certain The attribute of class.For such case, the application proposes to use Mining class association rules, and finding can most represent class CiCorrelation rule set These property sets, are referred to as judgement property set by (i.e. the first attribute set).Then found and the example according to certain matched rule Most close correlation rule (the first attribute set) is (s1, s2…sn), then its confidence level is closest to real p (Ci│E)。
Above-mentioned association rules mining algorithm has Apriori algorithm and FP- trees, can all excavate Strong association rule (i.e. first Attribute set) and its confidence level, but may calculate and class CiUnrelated correlation rule, causes information redundancy and extra internal memory Expense.In order to overcome the problem, it is preferable that the application can be calculated using the CAR-Apriori in Mining class association rules mining algorithm Method, only excavates and class CiRelated correlation rule and its confidence level, the probability of the category is belonged to using its confidence level as example.
CAR-Apriori algorithms can excavate the correlation rule of specified classification, and being excavated by adjusting support can represent class Other property set and their confidence level.Although the number of attributes of example is not big in body, according to CAR-Apriori algorithms Basic thought, different attributes can form the combination of different frequency, and this number of combinations is exponentially trend growth.In order to Overhead is reduced, partial data filtering has been carried out in actual treatment, that is, filter the frequency of occurrences in each category high Attribute.Because this generic attribute cannot provide the information that example belongs to certain type, similar to the stop words in text mining, such as "Yes", " " etc..
Alternatively, after the first attribute set is obtained, each first data acquisition system can be determined according to the first attribute set In each instance data destination probability, it is specific as follows:Obtain the second attribute set of each instance data, the second attribute set Include the attribute information that instance data has;The second attribute set based on the first attribute set and each instance data is true The destination probability of fixed each instance data.
The quantity of the first above-mentioned attribute set is at least one, based on the first attribute set and each instance data Before second attribute set determines the destination probability of each instance data, the confidence level of each the first attribute set can be obtained, put Reliability is used for the probability for indicating the data with all properties information in the first attribute set to belong to target type.
Determine the target of each instance data in the second attribute set based on the first attribute set and each instance data During probability, can be matched with each first attribute set by by the second attribute set of instance data, be determined instance number At least one object matching can be obtained according to the object matching degree relative to each the first attribute set, i.e. each instance data Degree;Using the confidence level of the first attribute set corresponding with the maximum matching degree at least one object matching degree as instance data Destination probability.
Alternatively, matched with each first attribute set by by the second attribute set of instance data, it is determined that Instance data can be realized in the following way relative to the object matching degree of each the first attribute set:Determine instance data First matching degree of the attribute information in attribute information and the first attribute set in the second attribute set;Determine the first property set Second matching degree of the attribute information in conjunction and the attribute information in the second attribute set of instance data;According to the first matching degree Determine object matching degree of the instance data relative to the first attribute set with the second matching degree.
Attribute letter in attribute information and the first attribute set in second attribute set of above-mentioned determination instance data First matching degree of breath includes:Determine the objective attribute target attribute matched with the attribute information in the first attribute set in the second attribute set The quantity of information, such as in the second attribute set with the first attribute set in attribute information identical target property information number Amount;Using the ratio of the quantity of the attribute information in the quantity of target property information and the first attribute set as the first matching degree.
Attribute information in the above-mentioned attribute set of determination first is believed with the attribute in the second attribute set of instance data Second matching degree of breath includes:Determine the objective attribute target attribute matched with the attribute information in the second attribute set in the first attribute set The quantity of information, such as in the first attribute set with the second attribute set in attribute information identical target property information number Amount;Using the ratio of the quantity of the attribute information in the quantity of target property information and the second attribute set as the second matching degree.
Above-mentioned determines target of the instance data relative to the first attribute set according to the first matching degree and the second matching degree Matching degree includes:Using the first matching degree and the second matching degree and as object matching degree, or by the first matching degree and second The product of matching degree is used as object matching degree.
In the technical scheme that step S206 is provided, the target based on all instance datas in each first data acquisition system is general Rate determines that the acquisition quality information of each the first data acquisition system includes at least one of:Determine all realities in the first data acquisition system First average value of the destination probability of number of cases evidence, the first average value is used for the accurate of the data that instruction is collected according to target type Degree;Determine the entropy of the destination probability of all instance datas in the first data acquisition system, entropy is used to indicate to be adopted according to target type The hybrid UV curing of the data for collecting, acquisition quality information includes the first average value and/or entropy.
Optionally it is determined that the entropy of the destination probability of all instance datas includes in the first data acquisition system:By to first The computing that take the logarithm of the destination probability of all instance datas in data acquisition system determines entropy.
The application proposition two is estimated to evaluate the quality of is-a relations in a concept, and one of them is by the flat of probability Average Z (Ci) correctness of is-a relations is assessed, formula is as follows:
It should be noted that Z (Ci) distribution situation of data cannot be embodied, i.e., can not express example in such mixes journey Degree.
Therefore the application also proposition reflects the distribution situation of data with comentropy, is denoted as M (Ci), information severity of mixing up is got over Height, entropy is bigger.The probability by stages that example belongs to certain classification is divided into several intervals, the probability in the i-th interval that falls is remembered It is qi, interval number is n, M (Ci) computing formula it is as follows:
With developing rapidly for semantic network technology, body has been applied to increasing field, and this weight is commented Estimate and have become semantic network technology with essential part.RDF data is-a relation matter is evaluated present applicant proposes one kind The assessment method of amount, the probability that example belongs to its classification is calculated based on Mining class association rules, and the hybrid UV curing of class is by entropy come table Existing, the correctness of is-a relations is showed by the mathematical expectation of probability that example belongs to class in class.The two are estimated can be relatively comprehensively, correctly Reflection RDF data is-a Relationship Qualities.This not only provides an evaluation reference to ontological construction person, so that they have found to know Know stock problem, and provide a reference to body user and select " best " body so as to them.
In the technical scheme that step S208 is provided, it is determined that acquisition quality information meets default in multiple first data acquisition systems Quality requirement is that target data set for carrying out data analysis includes:By the first average value in multiple first data acquisition systems Reach the first preset value and/or entropy reach the second preset value as target data set.
Above-mentioned the first preset value and the second preset value is the numerical value pre-set according to demand, by using these numbers Value can filter out the preferable data acquisition system of acquisition quality.
In above-described embodiment, the preferable data of acquisition quality can be selected from multiple data acquisition systems by the present processes Set, for carrying out data analysis, is conducive to obtaining correct analysis result.
Using the present processes, the quality evaluation to is-a in RDF can also be realized.Specifically based on each the first number Determine according to the destination probability of all instance datas in set after the acquisition quality information of each the first data acquisition system, obtain multiple Second average value of the first average value of the first data acquisition system, the data in multiple first data acquisition systems meet preset relation, in advance If relation is used to indicate data and the type belonging to data, the second average value is used for the data for indicating to be collected according to preset relation The degree of accuracy;The 3rd average value of the entropy of multiple first data acquisition systems is obtained, the 3rd average value is used to indicate according to default pass The hybrid UV curing of the data that system collects.
In embodiments herein, the correctness of the quality of data can be described with Z (O), with mixing in M (O) description classifications Miscellaneous degree.Wherein O represents a body, CjJ-th class is represented, information severity of mixing up is higher, M (O) is bigger, is-a relations in data Quality correctness Z higher (O) is bigger.The computing formula of Z (O) and M (O) is as follows:
The application proposes that a kind of method based on Classification Management rule digging is carried out come the quality to is-a relations in body Assess, and propose two and estimate to evaluate the quality of classification and concept, it is intended to ensure the quality of body, realize to the effective of body Safeguard, so that for ontological construction person provides reference frame, selection gist is provided for body is used.
With reference to the implementation method shown in Fig. 3 in detail embodiments herein is described in detail.The present processes can be with software It is divided into four modules as shown in Figure 3:
Data preprocessing module 32, for extracting the data in knowledge base, builds the transaction table that Mining class association rules are excavated, Excavate the judgement property set and its confidence level of each concept C in transaction table.
Probability evaluation entity 34, matches for the judgement attribute set according to C to example and concept.
Quality Calculation Module 36, two for calculating concept quality are estimated.
Quality assessment modules 38, the quality evaluation that completion is asserted to Type in each concept, and it is input into assessment data and day Will, with for reference.
Step S402, data preprocessing module obtains data from data source (as obtained terminology data);
Step S404, obtains tables of data and builds entity, data, matroid;
Step S406, does Mining class association rules and excavates according to matrix, obtains Strong association rule set point and the confidence level of class;
Step S408, example is matched with the Strong association rule set of classification, obtains the destination probability of instance data;
Step S410, class probability probability-weighted and and entropy are tried to achieve by interval cutting;
Step S412, calculates the is-a instruction scorings of data acquisition system, and output journal log understands in order to user.
(1) data preprocessing module
Data preprocessing module is that the module is inquired about from data by SPARQL for subsequent association rule mining is serviced Source obtains the attribute and type information of example, and then builds the used transaction table T of Mining class association rules excavation, so that classification is closed Connection rule digging is used.Transaction Information (Transaction) is divided into two parts by transaction table T.Part I is Tp={ tp1, tp2... tpn, TpIn each element be an attribute set (i.e. the second attribute set), tpnRepresent n-th number of transactions According to attribute set, the second part is Tc={ C1, C2..., Cn, wherein, CnRepresent the concept belonging to n-th Transaction Information. Not comprising certain attribute, then the example is then using closed world assumption (Close World Assumption), i.e. example for the application Do not possess the attribute.Final affairs are as shown in table 1, wherein:
Table 1
Instance Name name Birthday height weight Class
Aaron_Line 1 1 1 1 Person
Washington 1 0 0 0 Place
Bummer 1 1 1 1 Person
Edmond 1 1 1 0 Person
……
In table 1, name, Birthday, height, weight represent attribute, Class represent class (i.e. data type or Concept).
(2) probability evaluation entity
The application uses Mining class association rules mining algorithm, the judgement property set of each class is calculated, then according to matching Rule is matched example and concept, and is matched and judged that the confidence level of property set belongs to the posteriority of the concept as the example Probability.Specific matched rule is as follows:
Obtain representing the Association Rules (i.e. the first attribute set) of each classification afterwards, it is necessary to according to certain matching Strategy to find judge with the case similarity highest and collects, and represents the example with the confidence level of the judgement collection and belongs to the general of this class Rate, the matching strategy that the application is proposed is by a property set E for example and certain judgement collection NiMatching, with attribute in matching set Number | S | account for | Ni| proportion carry out the degree of accuracy of expression matching, the proportion of | E | is accounted for number | S | of attribute in matching set As its contribution margin, ranking then is done with both products, finally choose the confidence level for matching maximum, belonged to as example E The probability of the category, specific computing formula is as follows:
Wherein, Candidate Set NiCollection be combined into N, S is matching item collection, and E is instance properties collection.
In order to further illustrate the calculation that example belongs to certain class probability, said with specific data instance below It is bright, such as excavated for Person classes and obtain correlation rule set (i.e. the first attribute set) including following two:
[[birthday, name, age, address]->person;Confidence=0.9];
[birthday, gender, name, graduation, email, tell, blogAddress]->person; Confidence=0.8].
Birthday, name, age, address, gender, graduation, email, tell in attribute set, BlogAddress represents the judgement attribute information of person, and confidence represents confidence level.
Obtained and first candidate item according to matching strategy now with example E [birthday, name, gender, age] Matching score is 3/4*3/4, is 3/7*3/4 with the matching score of second candidate item, therefore the Candidate Set most matched with example E Be [[birthday, name, age, address]->person;Confidence=0.9], then the example belongs to class The probability of Person is 0.9.
(3) Quality Calculation Module
The application estimates to evaluate the quality of is-a relations in a concept by two.One of them is by the flat of probability Average Z (Ci) assess the correctness of is-a relations.
But Z (Ci) distribution situation of data cannot be embodied, i.e., can not express the severity of mixing up of example in such.For example:Two Organizing the probability distribution of example A and B is, A:{ 0.1,0.8,0.8,0.8,0.8 }, B:{ 0.2,0.4,0.9,0.9,0.9 }, its is average Value is all 0.66.And in fact, A groups almost only one of which noise data, and B groups probably have two noise datas, it should Be A group data quality it is higher, be that their score is equal.The reason for this phenomenon occur is due to Z (Ci) have ignored data The situation of distribution, therefore the application goes back available information entropy to reflect data distribution situation.It is denoted as M (Ci).Information severity of mixing up is got over Height, entropy is bigger.
In addition, the correctness in order to verify the method that the application is proposed, can also in the following way in each set Noise data inspected by random samples, for example:Following SPARQL sentences are performed in DBpedia:
selectx where{
x a dbpedia-owl:Person.
x a dbpedia-owl:Organization.}
Above-mentioned sentence can search the example for belonging to Person classes and Organization classes simultaneously from DBpedia, this In example refer to certain specific people or tissue, by general knowledge, Person and Organization be it is disjoint, That is the example in Person can not possibly be present in Organization, but obtain a series of returning according to above-mentioned SPARQL Return result, such as Jordanhill_College.This shows to exist in DBpedia while belong to this two examples of class, because This can consider that Jordanhill_College is a noise data.Through the above way can be to the data in data acquisition system Verified.
(4) quality assessment modules
The application describes the correctness of the quality of data with Z (O), and the hybrid UV curing in classification is described with M (O).Wherein O represents one Individual body, CjJ-th class is represented, information severity of mixing up is higher, M (O) is bigger, is-a Relationship Qualities correctness Z higher in data (O) it is bigger.
The quality evaluation that Type is asserted in completing to each concept by above-mentioned formula, and it is input into assessment data and day Will, with for reference.
What the is-a relations in body reflected is the relation of example and classification, and example is the basis of other axioms in body, And the example in most of body is obtained by the method for automatic extraction or isomer data integration, therefore the meeting in instance layer There is substantial amounts of noise data.This noise data can cause the application based on body to obtain the data and information of mistake.Pass through A kind of the present processes, it is proposed that assessment method of evaluation RDF data is-a Relationship Qualities, are calculated based on Mining class association rules Go out the probability that example belongs to its classification, the hybrid UV curing of class is showed by entropy, the correctness of is-a relations is belonged to by example in class Showed in the mathematical expectation of probability of class.The two are estimated can relatively comprehensively, correctly reflect the is-a Relationship Qualities of RDF data.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention not by described by sequence of movement limited because According to the present invention, some steps can sequentially or simultaneously be carried out using other.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, involved action and module is not necessarily of the invention It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot In the case of the former be more preferably implementation method.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and computer software product storage is in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions are used to so that a station terminal equipment (can be mobile phone, calculate Machine, server, or network equipment etc.) perform method described in each embodiment of the invention.
Embodiment 2
According to embodiments of the present invention, a kind of data acquisition system for implementing the determination method of above-mentioned data acquisition system is additionally provided Determining device.Fig. 5 is the schematic diagram of the determining device of a kind of optional data acquisition system according to embodiments of the present invention, such as Fig. 5 Shown, the device can include:Receiving unit 52, the first determining unit 54, the second determining unit 56 and the 3rd determining unit 58。
Receiving unit 52, for receiving the finger for indicating the acquisition target data set from multiple first data acquisition systems Order, wherein, the first data acquisition system includes at least one instance data collected according to target type, target data set Data are used to carry out data analysis;
First determining unit 54, for determining each instance data in each first data acquisition system according to the first attribute set Destination probability, wherein, destination probability belongs to the probability of target type for instance data, and the first attribute set is included for referring to It is shown as the attribute of the data of target type;
Second determining unit 56, determines every for the destination probability based on all instance datas in each first data acquisition system The acquisition quality information of individual first data acquisition system, wherein, acquisition quality information is used for indicate to be collected according to target type the The quality of one data acquisition system;
3rd determining unit 58, for determining multiple first data acquisition systems in acquisition quality information meet preset quality requirement Be target data set for carrying out data analysis.
It should be noted that the receiving unit 52 in the embodiment can be used for performing the step in the embodiment of the present application 1 S202, the first determining unit 54 in the embodiment can be used for performing the step S204 in the embodiment of the present application 1, the embodiment In the second determining unit 56 can be used for perform the embodiment of the present application 1 in step S206, in the embodiment the 3rd determination Unit 58 can be used for performing the step S208 in the embodiment of the present application 1.
Herein it should be noted that above-mentioned module is identical with example and application scenarios that the step of correspondence is realized, but not It is limited to the disclosure of that of above-described embodiment 1.It should be noted that above-mentioned module as a part for device may operate in as In hardware environment shown in Fig. 1, can be realized by software, it is also possible to realized by hardware.
By above-mentioned module, the finger for indicating the acquisition target data set from multiple first data acquisition systems is being received When making, determine that the instance data in the first data acquisition system belongs to the probability of target type, Ran Houji by the first attribute set The destination probability of all instance datas determines the acquisition quality information of the first data acquisition system in the first data acquisition system, and therefrom selects The target data set for going out to meet preset quality requirement is shared in data analysis is carried out, and can solve cannot obtain in correlation technique To the technical problem of quality data acquisition system higher, and then the technique effect for getting quality data acquisition system higher is reached, protected The reliability of data results is demonstrate,proved.
Above-mentioned data acquisition system is that what is collected meet the example of is-a relations according to target type (i.e. concept or class) Data, the acquisition mode of instance data can be that extraction automatically or the mode using isomer data integration are obtained, such as knowledge base DBpeida obtains instance data by extracting the page of wikipedia Wikipedia.
Above-mentioned data analysis refers in finding and disclose by being excavated to data and being processed and lying in data Rule.
The attribute information that the first above-mentioned attribute set includes is can be for the attribute that describes above-mentioned target type Information, can interpolate that out whether instance data belongs to target type by these attribute informations.
Above-mentioned acquisition quality information can be that the collection of all instance datas in describing the first data acquisition system is accurate The information of the acquisition qualities such as degree, distribution situation, hybrid UV curing.
Alternatively, the first above-mentioned determining unit is additionally operable to determining each first data acquisition system according to the first attribute set In each instance data destination probability before, obtain the second data acquisition system, wherein, each data in the second data acquisition system Belong to target type;Data mining is carried out by the second data acquisition system, the first attribute set is obtained.
Using the technical scheme of the application, the matter of is-a relations (i.e. target type or concept) in a body can be assessed Amount, in this process, significant challenge is the example for how finding out is-a relation mistakes in body, and applicant is by carefully grinding Found after studying carefully, each concept CiHave and only contain an attribute set Pc={ p1, p2..., pn, PcIt is attribute P in knowledge base A subset, then can there is at least one PcSubset D Pc, can be used for describing concept Ci, then DPcCan be described as judging Property set (i.e. the first attribute set), if the attribute of example belongs to certain concept CiJudgement property set, then the example be likely to Belong to CiOtherwise, the example is then likely to noise data.For example, the example for Country, usually contains Caption (i.e. Capital) attribute, and the example for Person usually contains Birthday (birthday) this attribute, it can be seen from general knowledge, one Individual country is that, containing capital, and people then has the birthday of himself, if the example of a Country contain Birthday this Individual attribute then the example there is a strong possibility is a noise data.
Mining class association rules mining algorithm can be used, the judgement property set of each class is calculated, then using matched rule Example and concept are matched, and will match to judge that the confidence level of property set is general as the posteriority that the example belongs to the concept Rate, i.e. destination probability.
Definitions example E (a1, a2…an), wherein aiIt is the attribute of example E (i.e. target type), then E belongs to class CiProbability p (Ci│ e)=p (Ci│a1, a2…an)。p(Ci│a1, a2…an) can be tried to achieve by statistics.Because data are unreliable in itself and exist Atypia attribute, if directly statistics occurs larger error, atypia attribute refers to the extremely low frequency of occurrences, and can not express certain The attribute of class.For such case, the application proposes to use Mining class association rules, and finding can most represent class CiCorrelation rule set These property sets, are referred to as judgement property set by (i.e. the first attribute set).Then found and the example according to certain matched rule Most close correlation rule (the first attribute set) is (s1, s2…sn), then its confidence level is closest to real p (Ci│E)。
Above-mentioned association rules mining algorithm has Apriori algorithm and FP- trees, can all excavate Strong association rule (i.e. first Attribute set) and its confidence level, but may calculate and class CiUnrelated correlation rule, causes information redundancy and extra internal memory Expense.In order to overcome the problem, it is preferable that the application can be calculated using the CAR-Apriori in Mining class association rules mining algorithm Method, only excavates and class CiRelated correlation rule and its confidence level, the probability of the category is belonged to using its confidence level as example.
CAR-Apriori algorithms can excavate the correlation rule of specified classification, and being excavated by adjusting support can represent class Other property set and their confidence level.Although the number of attributes of example is not big in body, according to CAR-Apriori algorithms Basic thought, different attributes can form the combination of different frequency, and this number of combinations is exponentially trend growth.In order to Overhead is reduced, partial data filtering has been carried out in actual treatment, that is, filter the frequency of occurrences in each category high Attribute.Because this generic attribute cannot provide the information that example belongs to certain type, similar to the stop words in text mining, such as "Yes", " " etc..
Alternatively, as shown in fig. 6, the first determining unit 54 includes:Acquisition module 542, for obtaining each instance data The second attribute set, wherein, the second attribute set includes the attribute information that instance data has;First determining module 544, determine that the target of each instance data is general for the second attribute set based on the first attribute set and each instance data Rate.
It should be noted that the quantity of the first attribute set is at least one, the first determining module includes:Obtain submodule Block, the confidence level for obtaining each the first attribute set, wherein, confidence level is used to indicate with all in the first attribute set The data of attribute information belong to the probability of target type;Determination sub-module, for by by the second attribute set of instance data Matched with each first attribute set, determined object matching of the instance data relative to each the first attribute set Degree;Treatment submodule, for will be put with corresponding first attribute set of maximum matching degree at least one object matching degree Reliability as instance data destination probability.
Above-mentioned determination sub-module is additionally operable to:Determine the attribute information and the first category in the second attribute set of instance data First matching degree of the attribute information in property set;Determine the second category of the attribute information and instance data in the first attribute set Second matching degree of the attribute information in property set;Determine instance data relative to according to the first matching degree and the second matching degree The object matching degree of one attribute set.
Specifically, above-mentioned determination sub-module is determined as follows the attribute letter in the second attribute set of instance data First matching degree of the attribute information in breath and the first attribute set:Determine in the second attribute set with the first attribute set in The quantity of the target property information of attribute information matching, such as in the second attribute set with the first attribute set in attribute information phase The quantity of same target property information;By the quantity of the attribute information in the quantity of target property information and the first attribute set Ratio is used as the first matching degree.
Above-mentioned determination sub-module is determined as follows the of attribute information in the first attribute set and instance data Second matching degree of the attribute information in two attribute sets:Determine to believe with the attribute in the second attribute set in the first attribute set Cease the quantity of the target property information of matching, such as in the first attribute set with the second attribute set in attribute information identical mesh Mark the quantity of attribute information;The ratio of the quantity of the attribute information in the quantity of target property information and the second attribute set is made It is the second matching degree.
Alternatively, as shown in fig. 7, the second determining unit 56 includes:Second determining module 562, for determining the first data First average value of the destination probability of all instance datas in set, wherein, the first average value is used to indicate according to target type The degree of accuracy of the data for collecting;3rd determining module 564, the target for determining all instance datas in the first data acquisition system The entropy of probability, wherein, entropy is used for the hybrid UV curing of the data that instruction is collected according to target type, and acquisition quality information includes First average value and/or entropy.
The application proposition two is estimated to evaluate the quality of is-a relations in a concept, and one of them is by the flat of probability Average Z (Ci) correctness of is-a relations is assessed, formula is as follows:
It should be noted that Z (Ci) distribution situation of data cannot be embodied, i.e., can not express example in such mixes journey Degree.
Therefore the application also proposition reflects the distribution situation of data with comentropy, is denoted as M (Ci), information severity of mixing up is got over Height, entropy is bigger.The probability by stages that example belongs to certain classification is divided into several intervals, the probability in the i-th interval that falls is remembered It is qi, interval number is n, M (Ci) computing formula it is as follows:
With developing rapidly for semantic network technology, body has been applied to increasing field, and this weight is commented Estimate and have become semantic network technology with essential part.RDF data is-a relation matter is evaluated present applicant proposes one kind The assessment method of amount, the probability that example belongs to its classification is calculated based on Mining class association rules, and the hybrid UV curing of class is by entropy come table Existing, the correctness of is-a relations is showed by the mathematical expectation of probability that example belongs to class in class.The two are estimated can be relatively comprehensively, correctly Reflection RDF data is-a Relationship Qualities.This not only provides an evaluation reference to ontological construction person, so that they have found to know Know stock problem, and provide a reference to body user and select " best " body so as to them.
Alternatively, the 3rd determining unit is additionally operable to for the first average value in multiple first data acquisition systems to reach the first preset value And/or entropy reach the second preset value as target data set.
Using the present processes, the quality evaluation to is-a in RDF can also be realized.Specifically based on each the first number Determine according to the destination probability of all instance datas in set after the acquisition quality information of each the first data acquisition system, obtain multiple Second average value of the first average value of the first data acquisition system, the data in multiple first data acquisition systems meet preset relation, in advance If relation is used to indicate data and the type belonging to data, the second average value is used for the data for indicating to be collected according to preset relation The degree of accuracy;The 3rd average value of the entropy of multiple first data acquisition systems is obtained, the 3rd average value is used to indicate according to default pass The hybrid UV curing of the data that system collects.
The application proposes that a kind of method based on Classification Management rule digging is carried out come the quality to is-a relations in body Assess, and propose two and estimate to evaluate the quality of classification and concept, it is intended to ensure the quality of body, realize to the effective of body Safeguard, so that for ontological construction person provides reference frame, selection gist is provided for body is used.
Herein it should be noted that above-mentioned module is identical with example and application scenarios that the step of correspondence is realized, but not It is limited to the disclosure of that of above-described embodiment 1.It should be noted that above-mentioned module as a part for device may operate in as In hardware environment shown in Fig. 1, can be realized by software, it is also possible to realized by hardware, wherein, hardware environment includes network Environment.
Embodiment 3
According to embodiments of the present invention, additionally provide a kind of server for implementing the determination method of above-mentioned data acquisition system or Terminal.
Fig. 8 is a kind of structured flowchart of terminal according to embodiments of the present invention, as shown in figure 8, the terminal can include:One Individual or multiple (one is only shown in figure) processor 801, memory 803 and transmitting device 805 are (in above-mentioned embodiment Dispensing device), as shown in figure 8, the terminal can also include input-output equipment 807.
Wherein, memory 803 can be used to store software program and module, such as method and apparatus in the embodiment of the present invention Corresponding programmed instruction/module, software program and module of the processor 801 by operation storage in memory 803, so that Various function application and data processing are performed, that is, realizes above-mentioned method.Memory 803 may include high speed random access memory, Nonvolatile memory, such as one or more magnetic storage device, flash memory or other nonvolatile solid states can also be included Memory.In some instances, memory 803 can further include the memory remotely located relative to processor 801, these Remote memory can be by network connection to terminal.The example of above-mentioned network include but is not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Above-mentioned transmitting device 805 is used to that data to be received or sent via network, can be also used for processor with Data transfer between memory.Above-mentioned network instantiation may include cable network and wireless network.In an example, Transmitting device 805 includes a network adapter (Network Interface Controller, NIC), and it can be by netting twine It is connected with router so as to be communicated with internet or LAN with other network equipments.In an example, transmission dress 805 are put for radio frequency (Radio Frequency, RF) module, it is used to wirelessly be communicated with internet.
Wherein, specifically, memory 803 is used to store application program.
Processor 801 can call the application program of the storage of memory 803 by transmitting device 805, to perform following steps Suddenly:The instruction for indicating the acquisition target data set from multiple first data acquisition systems is received, wherein, the first data acquisition system Include at least one instance data collected according to target type, the data of target data set are used to carry out data point Analysis;The destination probability of each instance data in each first data acquisition system is determined according to the first attribute set, wherein, destination probability Belong to the probability of target type for instance data, the first attribute set includes the category of the data for being designated as target type Property;Destination probability based on all instance datas in each first data acquisition system determines the acquisition quality of each the first data acquisition system Information, wherein, acquisition quality information is used for the quality of the first data acquisition system that instruction is collected according to target type;It is determined that multiple What acquisition quality information met preset quality requirement in the first data acquisition system is the target data set for carrying out data analysis.
Processor 801 is additionally operable to perform following step:The second attribute set of each instance data is obtained, wherein, second Attribute set includes the attribute information that instance data has;The second category based on the first attribute set and each instance data Property set determines the destination probability of each instance data.
Using the embodiment of the present invention, receiving for indicating to obtain target data set from multiple first data acquisition systems Instruction when, determine that the instance data in the first data acquisition system belongs to the probability of target type by the first attribute set, so The destination probability based on all instance datas in the first data acquisition system determines the acquisition quality information of the first data acquisition system afterwards, and from In select and meet the target data set of preset quality requirement and share in data analysis is carried out, can solve cannot in correlation technique The technical problem of quality data acquisition system higher is got, and then reaches the technology effect for getting quality data acquisition system higher Really, it is ensured that the reliability of data results.
Alternatively, the specific example in the present embodiment may be referred to showing described in above-described embodiment 1 and embodiment 2 Example, the present embodiment will not be repeated here.
It will appreciated by the skilled person that the structure shown in Fig. 8 is only to illustrate, terminal can be smart mobile phone (such as Android phone, iOS mobile phones), panel computer, palm PC and mobile internet device (Mobile Internet Devices, MID), the terminal device such as PAD.Fig. 8 it does not cause to limit to the structure of above-mentioned electronic installation.For example, terminal is also May include components (such as network interface, display device) more more than shown in Fig. 8 or less, or with shown in Fig. 8 Different configurations.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completed come the device-dependent hardware of command terminal with by program, the program can be stored in a computer-readable recording medium In, storage medium can include:Flash disk, read-only storage (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..
Embodiment 4
Embodiments of the invention additionally provide a kind of storage medium.Alternatively, in the present embodiment, above-mentioned storage medium can For performing the program code of the determination method of data acquisition system.
Alternatively, in the present embodiment, above-mentioned storage medium may be located at the multiple in the network shown in above-described embodiment On at least one of network equipment network equipment.
Alternatively, in the present embodiment, storage medium is arranged to storage for performing the program code of following steps:
S11, receives the instruction for indicating the acquisition target data set from multiple first data acquisition systems, wherein, the One data acquisition system includes at least one instance data collected according to target type, the data of target data set be used for into Row data analysis;
S12, the destination probability of each instance data in each first data acquisition system is determined according to the first attribute set, its In, destination probability belongs to the probability of target type for instance data, and the first attribute set is included for being designated as target type Data attribute;
S13, the destination probability based on all instance datas in each first data acquisition system determines each first data acquisition system Acquisition quality information, wherein, acquisition quality information is used for the matter of the first data acquisition system for indicating to be collected according to target type Amount;
S14, it is determined that what acquisition quality information met preset quality requirement in multiple first data acquisition systems is for entering line number According to the target data set of analysis.
Alternatively, storage medium is also configured to storage for performing the program code of following steps:
S21, obtains the second attribute set of each instance data, wherein, the second attribute set includes instance data institute The attribute information having;
S22, the second attribute set based on the first attribute set and each instance data determines the mesh of each instance data Mark probability.
Alternatively, the specific example in the present embodiment may be referred to showing described in above-described embodiment 1 and embodiment 2 Example, the present embodiment will not be repeated here.
Alternatively, in the present embodiment, above-mentioned storage medium can be included but is not limited to:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. is various can be with the medium of store program codes.
The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.
If integrated unit in above-described embodiment is to realize in the form of SFU software functional unit and as independent product When selling or using, can store in the storage medium that above computer can read.Based on such understanding, skill of the invention The part or all or part of the technical scheme that art scheme substantially contributes to prior art in other words can be with soft The form of part product is embodied, and the computer software product is stored in storage medium, including some instructions are used to so that one Platform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) perform each embodiment institute of the invention State all or part of step of method.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, can be by other sides Formula is realized.Wherein, device embodiment described above is only schematical, such as division of described unit, only one Kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can combine or Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it Between coupling or direct-coupling or communication connection can be the INDIRECT COUPLING or communication link of unit or module by some interfaces Connect, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be according to the actual needs selected to realize the mesh of this embodiment scheme 's.
In addition, during each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (16)

1. a kind of determination method of data acquisition system, it is characterised in that including:
The instruction for indicating the acquisition target data set from multiple first data acquisition systems is received, wherein, first number According at least one instance data that set includes being collected according to target type, the data of the target data set be used for into Row data analysis;
The destination probability of each instance data in each described first data acquisition system is determined according to the first attribute set, wherein, institute It is the probability that the instance data belongs to the target type to state destination probability, and first attribute set is included for indicating It is the attribute of the data of the target type;
Destination probability based on all instance datas in the first data acquisition system each described determines each described first data acquisition system Acquisition quality information, wherein, the acquisition quality information be used for indicate collected according to the target type described first The quality of data acquisition system;
It is determined that what acquisition quality information met preset quality requirement in multiple first data acquisition systems is for carrying out data point The target data set of analysis.
2. method according to claim 1, it is characterised in that each described first data is determined according to the first attribute set The destination probability of each instance data includes in set:
The second attribute set of each instance data is obtained, wherein, second attribute set includes the instance number According to the attribute information having;
Second attribute set based on first attribute set and each instance data determines each described example The destination probability of data.
3. method according to claim 2, it is characterised in that the quantity of first attribute set is at least one,
Determine each described reality in second attribute set based on first attribute set and each instance data Before the destination probability of number of cases evidence, methods described also includes:The confidence level of each first attribute set is obtained, wherein, institute State confidence level for indicate there is first attribute set in the data of all properties information belong to the general of the target type Rate;
Second attribute set based on first attribute set and each instance data determines each described example The destination probability of data includes:By by second attribute set of the instance data and each described first property set Conjunction is matched, and determines object matching degree of the instance data relative to the first attribute set each described;Will with least The confidence level of corresponding first attribute set of maximum matching degree in one object matching degree is used as the instance number According to destination probability.
4. method according to claim 3, it is characterised in that by by second attribute set of the instance data Matched with the first attribute set each described, determine the instance data relative to the first attribute set each described Object matching degree include:
Determine the attribute information in attribute information in the second attribute set of the instance data and first attribute set The first matching degree;
Determine the attribute information in the second attribute set of the attribute information and the instance data in first attribute set The second matching degree;
Determine the instance data relative to first attribute set according to first matching degree and second matching degree The object matching degree.
5. method according to claim 4, it is characterised in that determine the category in the second attribute set of the instance data First matching degree of the attribute information in property information and first attribute set includes:
The target property information for determining to be matched with the attribute information in first attribute set in second attribute set Quantity;
Using the ratio of the quantity of the attribute information in the quantity of the target property information and first attribute set as institute State the first matching degree.
6. method according to claim 1, it is characterised in that based on all instance numbers in the first data acquisition system each described According to destination probability determine each first data acquisition system acquisition quality information include at least one of:
Determine the first average value of the destination probability of all instance datas in first data acquisition system, wherein, described first is flat Average is used for the degree of accuracy of the data that instruction is collected according to the target type;
Determine the entropy of the destination probability of all instance datas in first data acquisition system, wherein, the entropy is used to indicate The hybrid UV curing of the data collected according to the target type, the acquisition quality information include first average value and/or The entropy.
7. method according to claim 6, it is characterised in that determine all instance numbers in first data acquisition system According to the entropy of destination probability include:
The entropy is determined by the computing that take the logarithm of the destination probability to all instance datas in first data acquisition system Value.
8. method according to claim 6, it is characterised in that it is determined that acquisition quality letter in multiple first data acquisition systems What breath met preset quality requirement is that the target data set for carrying out data analysis includes:
By the first average value reaches the first preset value and/or entropy reaches the second preset value in multiple first data acquisition system As the target data set.
9. method according to claim 6, it is characterised in that all examples in based on the first data acquisition system each described The destination probability of data determines after the acquisition quality information of each first data acquisition system that methods described also includes:
The second average value of first average value of multiple first data acquisition systems is obtained, wherein, multiple first numbers Meet preset relation according to the data in set, the preset relation is used to indicating data and the type belonging to data, described second Average value is used for the degree of accuracy of the data that instruction is collected according to the preset relation;
The 3rd average value of the entropy of multiple first data acquisition systems is obtained, wherein, the 3rd average value is used to refer to Show the hybrid UV curing of the data collected according to the preset relation.
10. method according to claim 1, it is characterised in that according to the first attribute set determine each described first In data acquisition system before the destination probability of each instance data, methods described also includes:
The second data acquisition system is obtained, wherein, each data in second data acquisition system belong to the target type;
Data mining is carried out by second data acquisition system, first attribute set is obtained.
A kind of 11. determining devices of data acquisition system, it is characterised in that including:
Receiving unit, for receiving the instruction for indicating the acquisition target data set from multiple first data acquisition systems, its In, first data acquisition system includes at least one instance data collected according to target type, the target data set The data of conjunction are used to carry out data analysis;
First determining unit, for determining each instance data in each described first data acquisition system according to the first attribute set Destination probability, wherein, the destination probability is the probability that the instance data belongs to the target type, first property set Conjunction includes the attribute of the data for being designated as the target type;
Second determining unit, each is determined for the destination probability based on all instance datas in the first data acquisition system each described The acquisition quality information of first data acquisition system, wherein, the acquisition quality information is used to indicate according to the target type The quality of first data acquisition system for collecting;
3rd determining unit, for determining multiple first data acquisition systems in acquisition quality information meet preset quality requirement It is the target data set for carrying out data analysis.
12. devices according to claim 11, it is characterised in that first determining unit includes:
Acquisition module, the second attribute set for obtaining each instance data, wherein, wrapped in second attribute set Include the attribute information that the instance data has;
First determining module, for second attribute set based on first attribute set He each instance data Determine the destination probability of each instance data.
13. devices according to claim 12, it is characterised in that the quantity of first attribute set is at least one, First determining module includes:
Acquisition submodule, the confidence level for obtaining each first attribute set, wherein, the confidence level is used to indicate tool The data for having all properties information in first attribute set belong to the probability of the target type;
Determination sub-module, for by by second attribute set of the instance data and each described first property set Conjunction is matched, and determines object matching degree of the instance data relative to the first attribute set each described;
Treatment submodule, for by first attribute corresponding with the maximum matching degree in object matching degree described at least one The confidence level of set as the instance data destination probability.
14. devices according to claim 13, it is characterised in that the determination sub-module is additionally operable to:
Determine the attribute information in attribute information in the second attribute set of the instance data and first attribute set The first matching degree;
Determine the attribute information in the second attribute set of the attribute information and the instance data in first attribute set The second matching degree;
Determine the instance data relative to first attribute set according to first matching degree and second matching degree The object matching degree.
15. devices according to claim 11, it is characterised in that second determining unit includes:
Second determining module, for determining that the first of destination probability of all instance datas in first data acquisition system is average Value, wherein, first average value is used for the degree of accuracy of the data that instruction is collected according to the target type;
3rd determining module, the entropy of the destination probability for determining all instance datas in first data acquisition system, wherein, The entropy is used for the hybrid UV curing of the data that instruction is collected according to the target type, and the acquisition quality information includes described First average value and/or the entropy.
16. devices according to claim 15, it is characterised in that the 3rd determining unit is additionally operable to multiple described the The first average value reaches the first preset value and/or entropy reaches target data described in the conduct of the second preset value in one data acquisition system Set.
CN201710069739.1A 2017-02-08 2017-02-08 Data set determination method and device Active CN106844718B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710069739.1A CN106844718B (en) 2017-02-08 2017-02-08 Data set determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710069739.1A CN106844718B (en) 2017-02-08 2017-02-08 Data set determination method and device

Publications (2)

Publication Number Publication Date
CN106844718A true CN106844718A (en) 2017-06-13
CN106844718B CN106844718B (en) 2022-04-26

Family

ID=59122086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710069739.1A Active CN106844718B (en) 2017-02-08 2017-02-08 Data set determination method and device

Country Status (1)

Country Link
CN (1) CN106844718B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101656A (en) * 2018-08-30 2018-12-28 东北石油大学 A kind of associated data method for evaluating quality based on ontology
CN112231422A (en) * 2020-12-16 2021-01-15 中国人民解放军国防科技大学 Graph data synthesis method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063489A (en) * 2010-12-29 2011-05-18 东北大学 Mode matching method based on implicit classifying information
CN102360394A (en) * 2011-10-27 2012-02-22 北京邮电大学 Ontology matching method based on lexical information and semantic information of ontology
US20140115670A1 (en) * 2012-10-23 2014-04-24 Edward M. Barton Authentication method of field contents based challenge and enumerated pattern of field positions based response in random partial digitized path recognition system
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN105045863A (en) * 2015-07-13 2015-11-11 苏州大学张家港工业技术研究院 Method and system used for entity matching

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063489A (en) * 2010-12-29 2011-05-18 东北大学 Mode matching method based on implicit classifying information
CN102360394A (en) * 2011-10-27 2012-02-22 北京邮电大学 Ontology matching method based on lexical information and semantic information of ontology
US20140115670A1 (en) * 2012-10-23 2014-04-24 Edward M. Barton Authentication method of field contents based challenge and enumerated pattern of field positions based response in random partial digitized path recognition system
CN104794163A (en) * 2015-03-25 2015-07-22 中国人民大学 Entity set extension method
CN105045863A (en) * 2015-07-13 2015-11-11 苏州大学张家港工业技术研究院 Method and system used for entity matching

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101656A (en) * 2018-08-30 2018-12-28 东北石油大学 A kind of associated data method for evaluating quality based on ontology
CN109101656B (en) * 2018-08-30 2021-05-25 东北石油大学 Association data quality evaluation method based on ontology
CN112231422A (en) * 2020-12-16 2021-01-15 中国人民解放军国防科技大学 Graph data synthesis method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106844718B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
US11537820B2 (en) Method and system for generating and correcting classification models
US11334635B2 (en) Domain specific natural language understanding of customer intent in self-help
CN107122346B (en) The error correction method and device of a kind of read statement
US11748501B2 (en) Tagging documents with security policies
US9990609B2 (en) Evaluating service providers using a social network
US9875294B2 (en) Method and apparatus for classifying object based on social networking service, and storage medium
US9146987B2 (en) Clustering based question set generation for training and testing of a question and answer system
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
US9183285B1 (en) Data clustering system and methods
US20140095148A1 (en) Emotion identification system and method
US9710829B1 (en) Methods, systems, and articles of manufacture for analyzing social media with trained intelligent systems to enhance direct marketing opportunities
CN110019790B (en) Text recognition, text monitoring, data object recognition and data processing method
CN109388634B (en) Address information processing method, terminal device and computer readable storage medium
CN111813905B (en) Corpus generation method, corpus generation device, computer equipment and storage medium
CN109145301B (en) Information classification method and device and computer readable storage medium
US11574123B2 (en) Content analysis utilizing general knowledge base
CN112418656A (en) Intelligent agent allocation method and device, computer equipment and storage medium
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN106844718A (en) The determination method and apparatus of data acquisition system
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112685656A (en) Label recommendation method and electronic equipment
CN113822390B (en) User portrait construction method and device, electronic equipment and storage medium
US20210241147A1 (en) Method and device for predicting pair of similar questions and electronic equipment
CN110162614B (en) Question information extraction method and device, electronic equipment and storage medium
CN113392181A (en) Text relevance determining method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant