CN106844718B

CN106844718B - Data set determination method and device

Info

Publication number: CN106844718B
Application number: CN201710069739.1A
Authority: CN
Inventors: 何彬彬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-02-08
Filing date: 2017-02-08
Publication date: 2022-04-26
Anticipated expiration: 2037-02-08
Also published as: CN106844718A

Abstract

The invention discloses a method and a device for determining a data set. Wherein, the method comprises the following steps: receiving an instruction for instructing to obtain a target data set from a plurality of first data sets, wherein the data of the target data set is used for data analysis; determining a target probability of each instance data in each first data set according to the first attribute sets, wherein the target probability is the probability that the instance data belong to a target type, and the first attribute sets comprise attributes used for indicating the data of the target type; determining acquisition quality information of each first data set based on the target probability of all the example data in each first data set, wherein the acquisition quality information is used for indicating the quality of the first data sets acquired according to the target type; and determining that the collected quality information in the first data sets meets the preset quality requirement as a target data set for data analysis. The invention solves the technical problem that the data set with higher quality cannot be acquired in the related technology.

Description

Data set determination method and device

Technical Field

The invention relates to the field of data analysis, in particular to a method and a device for determining a data set.

Background

Data analysis refers to the process of analyzing a large amount of collected data by using an appropriate statistical analysis method, extracting useful information and forming a conclusion to study and summarize the data in detail. Data analysis is typically related to computer science and achieves this through many methods such as statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (relying on past rules of thumb), and pattern recognition.

The data analysis has an extremely wide application range. A typical data analysis may include the following steps:

step 1, data acquisition, namely acquiring a plurality of data sets according to a set mode, and then carrying out data analysis by using one or more data sets with higher confidence coefficient.

And 2, analyzing exploratory data, wherein when the data is just obtained, the data is possibly disordered and the regularity cannot be seen, and the possible regularity forms are explored by means of drawing, tabulation, equation fitting in various forms, calculation of certain characteristic quantities and the like, namely, in what direction and in what way to find and reveal the regularity hidden in the data.

And 3, selecting and analyzing the models, providing one or more types of possible models on the basis of exploratory analysis, and then selecting a certain model from the possible models through further analysis.

Step 4, inference analysis, usually using mathematical statistics, infers the reliability and accuracy of the determined model or estimate.

Step 1 is particularly important in the whole data analysis process, and accurate data analysis results can be obtained only by selecting data with higher confidence.

After data acquisition is completed, a plurality of data sets are obtained, one or more data sets with higher quality are selected to play a crucial role in data analysis, and if data with more noise data are selected, data analysis is directly caused to obtain wrong results. Currently, the selected data is mainly selected randomly or selected by the user according to experience, and data with lower quality may be selected.

For the problem that a data set with high quality cannot be acquired in the related art, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining a data set, which are used for at least solving the technical problem that a data set with higher quality cannot be acquired in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a method for determining a data set, including: receiving an instruction for instructing to acquire a target data set from a plurality of first data sets, wherein the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis; determining a target probability of each instance data in each first data set according to the first attribute set, wherein the target probability is the probability that the instance data belongs to a target type, and the first attribute set comprises an attribute used for indicating the data of the target type; determining acquisition quality information of each first data set based on the target probability of all the example data in each first data set, wherein the acquisition quality information is used for indicating the quality of the first data sets acquired according to the target type; and determining that the collected quality information in the first data sets meets the preset quality requirement as a target data set for data analysis.

According to another aspect of the embodiments of the present invention, there is also provided a device for determining a data set, including: the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving an instruction for indicating to acquire a target data set from a plurality of first data sets, the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis; the first determining unit is used for determining a target probability of each instance data in each first data set according to a first attribute set, wherein the target probability is the probability that the instance data belongs to a target type, and the first attribute set comprises an attribute used for indicating the data which is the target type; the second determining unit is used for determining the acquisition quality information of each first data set based on the target probability of all the example data in each first data set, wherein the acquisition quality information is used for indicating the quality of the first data sets acquired according to the target type; and the third determining unit is used for determining that the acquired quality information in the first data sets meets the preset quality requirement as a target data set for data analysis.

In the embodiment of the invention, when an instruction for instructing to acquire a target data set from a plurality of first data sets is received, the probability that example data in the first data sets belong to a target type is determined through a first attribute set, then acquisition quality information of the first data sets is determined based on the target probabilities of all the example data in the first data sets, and the target data sets meeting preset quality requirements are selected for data analysis.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment for a data set determination method according to an embodiment of the invention;

FIG. 2 is a flow chart of an alternative method of determining a data set in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of software modules of a method for determining a data set according to an embodiment of the invention;

FIG. 4 is a flow chart of an alternative method of determining a data set in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative data set determination apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an alternative data set determination apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative data set determination apparatus according to an embodiment of the present invention; and

fig. 8 is a block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terms appearing in the description of the embodiments of the present invention are applied to the following explanations:

the formal definition of a body is divided into two, pentamer or heptamer, the definition of a pentamer being O ═ C, R, H^C,Rel,A^o) C is a set of concepts, R is a set of relationships, H^CHierarchy of representing concepts, Rel representing relationship between concepts, A^oShowing the axiom of the body; the seven-tuple is defined as O ═ C, A^C,R,A^RH, I, X, C is a set of concepts, A^oIs a set of concept attributes, R is a set of relationships, A^RIs a collection of relational attributes, H represents a collection of hierarchies, I is a collection of instances, and X is a collection of axioms.

Concepts (Concepts), also called classes (class), are a collection of objects with the same properties in a certain domain, such as: animals, humans, tissues, RDF are defined by the predefined attribute "RDF: Class". In addition, RDFS (resource Description Framework schema) also provides predefined classes to represent some simple data types, such as integer (xs), string (xs), etc.

Instances (Instances) are specializations of a certain concept or class, e.g., obama is an instance of the concept "human".

RDF (resource Description framework): the world wide web organization (W3C) introduced in 1999 a standard language resource description framework RDF for describing web resources, which is a major ontology description language that provides specifications for information description for various applications on the internet. RDF describes resources on the Web in the form of triples "< subject, predicate, object >", has become one of the standards for ontology description, and is widely used in the description of semantic Web and metadata.

The is-a relationship: generally speaking, the RDF knowledge base is divided into two parts, i.e. TBox and ABox, where TBox expresses the relationship between concepts in the knowledge base, and is-a relationship in TBox expresses the superior-inferior relationship between concepts, i.e. subclass-of relationship, for example: the subclass-of (Writer, Person) expresses that "Writer" is a subclass of "Person". Unlike TBox, which mainly contains the relationship between instances, the is-a relationship in ABox indicates that instances belong to a certain concept, i.e., instance-of relationship, for example, an instance in which Tom is Person is usually expressed as instance-of (Tom, Person). The abstraction of the subclass-of relationships is to formalize the hierarchy between the expression core concepts. The instance-of relationship reflects the relationship between the instance and the category, and is the basis of the connection between the concept layer and the instance layer. Therefore, the is-a relationship in the ontology is the basis of some key technologies, such as: inference, consistency detection, etc.

It should be noted that the is-a relationship in the ontology reflects the relationship between the instance and the category, and is the basis of the relationship between the concept layer and the instance layer, for example, Tom is a Person, an instance of Person is usually expressed as Tom is-a Person, Tom is not an instance of Organization, and Tom can be called a counter example of Organization.

In RDF, Instance is-a Type is a commonly used expression for Type (a) assertions in ontologies. In unary relationship Type (a), a represents instance information in the knowledge base and Type represents category or concept information in the knowledge base, which we call Type assertions.

Example 1

According to an embodiment of the present invention, a method embodiment of a method for determining a data set is provided.

Alternatively, in the present embodiment, the above-described method for determining a data set may be applied to a hardware environment formed by the server 102 and the terminal 104 as shown in fig. 1. As shown in fig. 1, a server 102 is connected to a terminal 104 via a network including, but not limited to: the terminal 104 is not limited to a PC, a mobile phone, a tablet computer, etc. in a wide area network, a metropolitan area network, or a local area network. The method of the embodiment of the present invention may be executed by the server 102, the terminal 104, or both the server 102 and the terminal 104. The terminal 104 may execute the method of the embodiment of the present invention by a client installed thereon.

Fig. 2 is a flowchart of an alternative data set determining method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

step S202, receiving an instruction for instructing to acquire a target data set from a plurality of first data sets, wherein the first data sets comprise at least one example data acquired according to a target type, and the data of the target data sets are used for data analysis;

step S204, determining a target probability of each instance data in each first data set according to the first attribute sets, wherein the target probability is the probability that the instance data belongs to a target type, and the first attribute sets comprise attributes used for indicating the data which are of the target type;

step S206, determining the collection quality information of each first data set based on the target probability of all the example data in each first data set, wherein the collection quality information is used for indicating the quality of the first data sets collected according to the target type;

step S208, determining that the collected quality information in the plurality of first data sets meets the preset quality requirement as a target data set for data analysis.

Through the steps S202 to S208, when an instruction for instructing to acquire a target data set from a plurality of first data sets is received, the probability that example data in the first data set belongs to a target type is determined through the first attribute set, then the acquisition quality information of the first data set is determined based on the target probabilities of all example data in the first data set, and a target data set meeting the preset quality requirement is selected for data analysis, so that the technical problem that a data set with higher quality cannot be acquired in the related art can be solved, the technical effect of acquiring the data set with higher quality is achieved, and the reliability of the data analysis result is ensured.

The data set is example data which is acquired according to a target type (namely a concept or a class) and meets the is-a relation, the acquisition mode of the example data can be automatically extracted or acquired by utilizing a heterogeneous data integration mode, for example, a knowledge base DBpeida acquires the example data by extracting a page of Wikipedia.

The data analysis refers to the process of mining and processing the data to find and reveal the rules hidden in the data.

The attribute information included in the first attribute set is attribute information that can be used to describe the target type, and it can be determined whether the instance data belongs to the target type through the attribute information.

The above-mentioned acquisition quality information may be information describing acquisition quality of all the example data in the first data set, such as acquisition accuracy, distribution, degree of mixing, and the like.

In the embodiment of the application, the method can be used for data processing and used for screening out data with better collection quality from multiple data, and mainly comprises the following steps: mining through classification association rules to obtain one or more judgment attribute sets of each concept C, and calculating the confidence degree of the judgment set belonging to the concept C; matching the attribute of the instance with the judgment attribute set of each concept C to obtain the confidence of the is-a relation of each instance; the quality of the concept in the ontology is evaluated by two measures proposed. Embodiments of the present application are detailed below with reference to fig. 2:

in the technical solution provided in step S202, in the process of data analysis by the user, data with a better acquisition instruction is obtained first, and the obtaining process may be automatic obtaining, that is, the computer receives an instruction for instructing to obtain a target data set from a plurality of first data sets.

In the technical solution provided in step S204, before determining a target probability of each instance data in each first data set according to the first attribute set, a second data set is obtained, where each data in the second data set belongs to a target type; and performing data mining on the second data set to obtain a first attribute set.

By using the technical scheme of the application, the quality of the is-a relation (namely the target type or the concept) in an ontology can be evaluated, in the process, the main challenge is how to find out the example of the error of the is-a relation in the ontology, and after the applicant finds out through careful research, each concept C_iWith and containing only one attribute set P_c＝{p₁，p₂…，p_n}，P_cIs a subset of attributes P in the knowledge base, then there will be at least one P_cSubset DP of_cCan be used to describe this concept C_iThen DP_cIt can be called a decision attribute set (i.e. a first attribute set) if the attributes of an instance belong to a certain concept C_iThe judgment attribute set of (2), then the instance is likely to belong to C_iWhereas the instance is likely to be noisy data. For example, for the Country instance, which usually contains the attribute of Caption (i.e. capital), and for the Person instance, which usually contains the attribute of Birthday, it is known from the common sense that a Country contains capital and a Person has his own Birthday, and if a Country instance contains Birthday, it is highly likely to be a noisy data.

The classification association rule mining algorithm can be used for calculating a judgment attribute set of each class, then matching the example with the concept by using the matching rule, and taking the confidence coefficient matched to the judgment attribute set as the posterior probability, namely the target probability, of the example belonging to the concept.

Definition instance E (a)₁，a₂…a_n) Wherein a is_iIs an attribute of instance E (i.e., target type), then E belongs to class C_iProbability p (C)_i│e)＝p(C_i│a₁，a₂…a_n)。p(C_i│a₁，a₂…a_n) Can be found by statistics. Since the data itself is unreliable and there are atypical attributes,if the direct statistics result in larger errors, the atypical attribute means that the attribute has extremely low occurrence frequency and cannot express a certain class. Aiming at the situation, the application provides a classification association rule for finding the most representative class C_iThe set of association rules (i.e., the first set of attributes) are referred to as the decision set of attributes. Then, according to a certain matching rule, finding out the most similar association rule (first attribute set) to the example as(s)₁，s₂…s_n) Then its confidence is closest to true p (C)_i│E)。

The association rule mining algorithm includes Apriori algorithm and FP-tree, which can both mine strong association rule (i.e. first attribute set) and its confidence, but may calculate and associate with class C_iUnrelated association rules cause information redundancy and additional memory overhead. To overcome this problem, the present application may preferably adopt the CAR-Apriori algorithm in the classification association rule mining algorithm to mine only class C_iThe associated rule and its confidence, with its confidence as an example, the probability of belonging to the category.

The CAR-Apriori algorithm can mine association rules of specified classes, and mine attribute sets capable of representing the classes and confidence degrees of the attribute sets by adjusting the support degrees. Although the number of attributes of an instance in an ontology is not large, according to the basic idea of the CAR-Apriori algorithm, different attributes form combinations with different frequencies, and the number of the combinations increases exponentially. In order to reduce the system overhead, partial data filtering is performed during actual processing, that is, attributes that appear very frequently in each category are filtered. Because such attributes do not provide information that an instance belongs to a certain type, similar to stop words in text mining, such as "yes", "no", etc.

Optionally, after obtaining the first attribute sets, the target probability of each instance data in each first data set may be determined according to the first attribute sets, specifically as follows: acquiring a second attribute set of each instance data, wherein the second attribute set comprises attribute information of the instance data; a target probability for each instance data is determined based on the first set of attributes and the second set of attributes for each instance data.

The number of the first attribute sets is at least one, and before determining the target probability of each instance data based on the first attribute sets and the second attribute sets of each instance data, the confidence of each first attribute set can be obtained, and the confidence is used for indicating the probability that the data with all the attribute information in the first attribute sets belong to the target type.

When the target probability of each instance data is determined based on the first attribute set and the second attribute set of each instance data, determining a target matching degree of the instance data relative to each first attribute set by matching the second attribute set of the instance data with each first attribute set, namely, each instance data can obtain at least one target matching degree; and taking the confidence coefficient of the first attribute set corresponding to the maximum matching degree in the at least one target matching degree as the target probability of the example data.

Optionally, by matching the second attribute set of the instance data with each first attribute set, determining the target matching degree of the instance data with respect to each first attribute set may be implemented as follows: determining a first matching degree of attribute information in a second attribute set of the example data and attribute information in a first attribute set; determining a second matching degree of the attribute information in the first attribute set and the attribute information in the second attribute set of the example data; and determining the target matching degree of the example data relative to the first attribute set according to the first matching degree and the second matching degree.

The determining the first matching degree between the attribute information in the second attribute set of the example data and the attribute information in the first attribute set includes: determining the number of target attribute information in the second attribute set, which matches the attribute information in the first attribute set, for example, the number of target attribute information in the second attribute set, which is the same as the attribute information in the first attribute set; and taking the ratio of the number of the target attribute information to the number of the attribute information in the first attribute set as a first matching degree.

The determining a second matching degree between the attribute information in the first attribute set and the attribute information in the second attribute set of the instance data includes: determining the number of target attribute information in the first attribute set, which is matched with the attribute information in the second attribute set, for example, the number of target attribute information in the first attribute set, which is the same as the attribute information in the second attribute set; and taking the ratio of the number of the target attribute information to the number of the attribute information in the second attribute set as a second matching degree.

The determining the target matching degree of the instance data relative to the first attribute set according to the first matching degree and the second matching degree includes: and taking the sum of the first matching degree and the second matching degree as the target matching degree, or taking the product of the first matching degree and the second matching degree as the target matching degree.

In the technical solution provided in step S206, determining the collection quality information of each first data set based on the target probability of all the example data in each first data set includes at least one of: determining a first average of the target probabilities of all instance data in the first data set, the first average being used to indicate the accuracy of the data collected according to the target type; entropy values of the target probabilities of all instance data in the first data set are determined, the entropy values indicating a degree of clutter of the data acquired according to the target type, and the acquisition quality information comprises a first average value and/or entropy values.

Optionally, determining entropy values of the target probabilities for all instance data in the first set of data comprises: the entropy value is determined by taking a logarithm of the target probabilities for all instance data in the first set of data.

The present application proposes two measures to evaluate the quality of the is-a relationship in a concept, one of which is by the mean value of the probabilities Z (C)_i) To evaluate the correctness of the is-a relationship, the formula is as follows:

in addition, Z (C)_i) The distribution of the data cannot be reflected, that is, the degree of mixing of the examples in the class cannot be expressed.

Therefore, the application also proposes that the distribution condition of the data is reflected by the information entropy, which is marked as M (C)_i) The higher the degree of information mixing, the larger the entropy value. Dividing the probability subareas of the examples belonging to a certain category into a plurality of subareas, and marking the probability of falling in the ith subarea as q_iThe number of the intervals is n, M (C)_i) The calculation formula of (a) is as follows:

with the rapid development of semantic web technology, ontologies have been applied to more and more fields, and ontology quality assessment has become an indispensable part for semantic web technology. The application provides an evaluation method for evaluating the quality of an is-a relation of RDF data, the probability that an example belongs to the class of the example is calculated based on a classification association rule, the degree of mixing of the class is expressed through entropy, and the correctness of the is-a relation in the class is expressed through the probability mean value that the example belongs to the class. The two measures can reflect the quality of the is-a relation of the RDF data more comprehensively and correctly. This provides not only an assessment reference to ontology builders so that they find problems with the knowledge base, but also a reference to ontology users so that they select a "best" ontology.

In the technical solution provided in step S208, determining that the target data set for data analysis, in which the collected quality information in the plurality of first data sets meets the preset quality requirement, includes: and taking the data set with the first average value reaching a first preset value and/or the entropy value reaching a second preset value in the plurality of first data sets as a target data set.

The first preset value and the second preset value are preset values according to requirements, and data sets with better collection quality can be filtered by using the values.

In the embodiment, the data set with better acquisition quality is selected from the plurality of data sets by the method for data analysis, so that a correct analysis result can be obtained.

By using the method, the quality evaluation of is-a in RDF can be realized. Specifically, after the acquisition quality information of each first data set is determined based on the target probability of all instance data in each first data set, a second average value of the first average values of the plurality of first data sets is obtained, the data in the plurality of first data sets meet a preset relation, the preset relation is used for indicating the data and the type of the data, and the second average value is used for indicating the accuracy of the data acquired according to the preset relation; and acquiring a third average value of entropy values of the plurality of first data sets, wherein the third average value is used for indicating the degree of mixing of the data acquired according to the preset relation.

In the embodiments of the present application, z (o) may be used to describe the correctness of the data quality, and m (o) may be used to describe the degree of promiscuity in the category. Wherein O represents an ontology, C_jThe higher the information mixing degree is, the larger M (O) is, and the higher the quality correctness of the is-a relation in the data is, the larger Z (O) is. The calculation formulas for Z (O) and M (O) are as follows:

the method is used for evaluating the quality of the is-a relation in the ontology based on a classification management rule mining method, and two measures are provided for evaluating the quality of categories and concepts, so that the quality of the ontology is guaranteed, the ontology is effectively maintained, a reference basis is provided for an ontology builder, and a selection basis is provided for the ontology.

Examples of the present application are described in detail below with reference to the embodiment shown in fig. 3. The method of the present application can be divided into four modules as shown in fig. 3 in terms of software:

and the data preprocessing module 32 is used for extracting data in the knowledge base, constructing a transaction table mined by the classification association rule, and mining the judgment attribute set and the confidence coefficient of each concept C in the transaction table.

And a probability calculation module 34, configured to match the instances and the concepts according to the judgment attribute set of C.

A quality calculation module 36 for calculating two measures of concept quality.

And the quality evaluation module 38 completes quality evaluation on the Type assertion in each concept and inputs evaluation data and logs for reference of a user.

Step S402, the data preprocessing module acquires data (such as acquiring term data) from a data source;

step S404, acquiring a data table construction entity, data and class matrix;

step S406, performing classification association rule mining according to the matrix to obtain a strong association rule set and a confidence coefficient of the class;

step S408, matching the examples with the strong association rule sets of the categories to obtain the target probability of the example data;

step S410, obtaining category probability weighted probability sum and entropy through interval segmentation;

and step S412, calculating the is-a instruction score of the data set, and outputting a log to facilitate understanding of a user.

(1) Data preprocessing module

The data preprocessing module serves for subsequent association rule mining, acquires attribute and type information of an instance from a data source through SPARQL query, and further constructs a transaction table T used for classification association rule mining so as to be used for the classification association rule mining. The Transaction table T divides the Transaction data (Transaction) into two parts. The first part is T_p＝{t_p1，t_p2，…t_pn}，T_pIs a set of attributes (i.e., a second set of attributes), t_pnA set of attributes representing the nth transaction data, the second part being T_c＝{C₁，C₂，…，C_nIn which C is_nIndicating the concept to which the nth transaction data belongs. The present application uses a closed World Assumption (Close World assertion), i.e., an instance does not contain an attribute and does not. The final transaction is shown in table 1, where:

TABLE 1

Name of example	name	Birthday	height	weight	Class
						Aaron_Line	1	1	1	1	Person
Washington	1	0	0	0	Place
						Bummer	1	1	1	1	Person
Edmond	1	1	1	0	Person
						……	…	…	…	…	…

In Table 1, name, Birthday, height, weight, denote attributes, and Class denotes a Class (i.e., data type or concept).

(2) Probability calculation module

The method comprises the steps of calculating a judgment attribute set of each class by using a classification association rule mining algorithm, matching an example and a concept according to a matching rule, and taking the confidence coefficient matched to the judgment attribute set as the posterior probability of the example belonging to the concept. The specific matching rules are as follows:

after obtaining an association rule set (i.e. a first attribute set) capable of representing each category, a decision set with the highest similarity to the instance needs to be found according to a certain matching strategy, and the confidence of the decision set represents the probability that the instance belongs to the category_iMatching, using the number of attributes in the matching set | S | in | N |_iThe matching accuracy is expressed by the proportion of | S |, the proportion of the attribute in the matching set | S | in | E | is used as the contribution value of | S |, the product of the two is used for ranking, and finally the matching to the maximum is selectedThe large confidence, as the probability that instance E belongs to this class, is calculated as follows:

wherein, the candidate set N_iIs N, S is a set of matching items, and E is a set of instance attributes.

To further illustrate the calculation manner of the probability that an instance belongs to a certain class, the following description takes specific data as an example, for example, the association rule set (i.e. the first attribute set) obtained for Person class mining includes the following two:

[[birthday，name，age，address]->person；confidence＝0.9]；

[birthday，gender，name，graduation，email，tell，blogAddress]->person；confidence＝0.8]。

the birthday, name, age, address, sender, summary, email, tell and blogAddress in the attribute set represent the judgment attribute information of person, and the confidence represents the confidence.

Now, an instance E [ birthday, name, genre, age ] gets a matching score of 3/4 × 3/4 with the first candidate and a matching score of 3/7 × 3/4 with the second candidate according to the matching strategy, so that the candidate set that best matches the instance E is [ [ birthday, name, age, address ] - > person; confidence is 0.9], then the probability of belonging to the Person class for this example is 0.9.

(3) Mass calculation module

The quality of the is-a relation in a concept is evaluated by two measures. One of the passing probabilities is the average Z (C)_i) To assess the correctness of the is-a relationship.

But Z (C)_i) The distribution of the data cannot be reflected, that is, the degree of mixing of the examples in the class cannot be expressed. For example: the probability distributions for the two sets of examples A and B are, A: {0.1, 0.8, 0.8, 0.8, 0.8}, B: {0.2, 0.4, 0.9, 0.9, 0.9}, all of which have an average value of 0.66. In fact, group A has almost only one noisy data, while group B has a high probability of havingThe two noisy data, which should be the a-group data of higher quality, are equal in score. The reason for this is due to Z (C)_i) The data distribution condition is ignored, so the application can also reflect the data distribution condition by using the information entropy. Notation M (C)_i). The higher the degree of information clutter, the larger the entropy value.

In addition, in order to verify the correctness of the method proposed by the present application, the noise data in each set may be subjected to spot check in the following manner, for example: the following SPARQL statement is executed in DBpedia:

selectx where{

？x a dbpedia-owl:Person.

？x a dbpedia-owl:Organization.}

the above statements may look for instances belonging to both the Person class and the Organization class from DBpedia, where an instance refers to a specific Person or Organization, and it is known that Person and Organization are disjoint, that is, the instances in Person may not exist in Organization, but a series of returned results are obtained according to the above-mentioned SPARQL, such as Jordanhill _ College, etc. This indicates that there are instances in DBpedia that belong to both of these classes, and therefore Jordanhill _ College can be considered a noisy data. The data in the data set can be verified in the above manner.

(4) Quality evaluation module

The application uses Z (O) to describe the correctness of data quality and uses M (O) to describe the promiscuity in the category. Wherein O represents an ontology, C_jThe higher the information mixing degree is, the larger M (O) is, and the higher the quality correctness of the is-a relation in the data is, the larger Z (O) is.

The quality evaluation of the Type assertion in each concept is completed through the formula, and evaluation data and logs are input for the user to refer to.

The is-a relationship in the ontology reflects the relationship between the instance and the category, the instance is the basis of other axioms in the ontology, and most instances in the ontology are obtained by automatic extraction or heterogeneous data integration, so a large amount of noise data exists in the instance layer. Such noisy data can result in erroneous data and information being obtained for ontology-based applications. The method comprises the steps of calculating the probability that an example belongs to the class of the RDF data based on a classification association rule, expressing the promiscuous degree of the class through entropy, and expressing the correctness of the is-a relation in the class through the probability mean value of the class to which the example belongs. The two measures can reflect the quality of the is-a relation of the RDF data more comprehensively and correctly.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to the embodiment of the invention, a data set determining device for implementing the data set determining method is also provided. Fig. 5 is a schematic diagram of an alternative data set determining apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus may include: a receiving unit 52, a first determining unit 54, a second determining unit 56 and a third determining unit 58.

A receiving unit 52, configured to receive an instruction for instructing to obtain a target data set from a plurality of first data sets, where the first data sets include at least one instance data collected according to a target type, and data of the target data set is used for data analysis;

a first determining unit 54, configured to determine a target probability of each instance data in each first data set according to a first attribute set, where the target probability is a probability that the instance data belongs to a target type, and the first attribute set includes an attribute indicating data of the target type;

a second determining unit 56, configured to determine, based on the target probabilities of all the instance data in each first data set, acquisition quality information of each first data set, where the acquisition quality information is used to indicate quality of the first data sets acquired according to the target type;

a third determining unit 58, configured to determine that the collected quality information in the plurality of first data sets meets a preset quality requirement as a target data set for performing data analysis.

It should be noted that the receiving unit 52 in this embodiment may be configured to execute step S202 in embodiment 1 of this application, the first determining unit 54 in this embodiment may be configured to execute step S204 in embodiment 1 of this application, the second determining unit 56 in this embodiment may be configured to execute step S206 in embodiment 1 of this application, and the third determining unit 58 in this embodiment may be configured to execute step S208 in embodiment 1 of this application.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of embodiment 1 described above. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

Through the module, when an instruction for indicating that a target data set is obtained from a plurality of first data sets is received, the probability that example data in the first data sets belong to a target type is determined through the first attribute set, then the acquisition quality information of the first data sets is determined based on the target probability of all example data in the first data sets, and the target data sets meeting the preset quality requirement are selected for data analysis.

Optionally, the first determining unit is further configured to obtain a second data set before determining the target probability of each instance data in each first data set according to the first attribute set, where each data in the second data set belongs to a target type; and performing data mining on the second data set to obtain a first attribute set.

Definition instance E (a)₁，a₂…a_n) Wherein a is_iIs an attribute of instance E (i.e., target type), then E belongs to class C_iProbability p (C)_i│e)＝p(C_i│a₁，a₂…a_n)。p(C_i│a₁，a₂…a_n) Can be found by statistics. Due to unreliable data and the existence of atypical attributes, if direct statistics causes large errors, the atypical attributes are attributes which have extremely low occurrence frequency and cannot express a certain class. Aiming at the situation, the application provides a classification association rule for finding the most representative class C_iThe set of association rules (i.e., the first set of attributes) are referred to as the decision set of attributes. Then, according to a certain matching rule, finding out the most similar association rule (first attribute set) to the example as(s)₁，s₂…s_n) Then its confidence is closest to true p (C)_i│E)。

The aboveThe association rule mining algorithm comprises an Apriori algorithm and an FP-tree, and can mine a strong association rule (namely a first attribute set) and confidence thereof, but can calculate the association rule with the class C_iUnrelated association rules cause information redundancy and additional memory overhead. To overcome this problem, the present application may preferably adopt the CAR-Apriori algorithm in the classification association rule mining algorithm to mine only class C_iThe associated rule and its confidence, with its confidence as an example, the probability of belonging to the category.

Alternatively, as shown in fig. 6, the first determination unit 54 includes: an obtaining module 542, configured to obtain a second attribute set of each instance data, where the second attribute set includes attribute information of the instance data; a first determining module 544, configured to determine a target probability for each instance data based on the first set of attributes and the second set of attributes for each instance data.

It should be noted that, the number of the first attribute sets is at least one, and the first determining module includes: the obtaining sub-module is used for obtaining the confidence of each first attribute set, wherein the confidence is used for indicating the probability that the data with all the attribute information in the first attribute set belong to the target type; the determining submodule is used for determining the target matching degree of the example data relative to each first attribute set by matching the second attribute set of the example data with each first attribute set; and the processing submodule is used for taking the confidence coefficient of the first attribute set corresponding to the maximum matching degree in the at least one target matching degree as the target probability of the example data.

The determination submodule further configured to: determining a first matching degree of attribute information in a second attribute set of the example data and attribute information in a first attribute set; determining a second matching degree of the attribute information in the first attribute set and the attribute information in the second attribute set of the example data; and determining the target matching degree of the example data relative to the first attribute set according to the first matching degree and the second matching degree.

Specifically, the determining sub-module determines a first matching degree between the attribute information in the second attribute set of the instance data and the attribute information in the first attribute set by: determining the number of target attribute information in the second attribute set, which matches the attribute information in the first attribute set, for example, the number of target attribute information in the second attribute set, which is the same as the attribute information in the first attribute set; and taking the ratio of the number of the target attribute information to the number of the attribute information in the first attribute set as a first matching degree.

The determining submodule determines a second matching degree between the attribute information in the first attribute set and the attribute information in the second attribute set of the instance data by: determining the number of target attribute information in the first attribute set, which is matched with the attribute information in the second attribute set, for example, the number of target attribute information in the first attribute set, which is the same as the attribute information in the second attribute set; and taking the ratio of the number of the target attribute information to the number of the attribute information in the second attribute set as a second matching degree.

Alternatively, as shown in fig. 7, the second determination unit 56 includes: a second determining module 562, configured to determine a first average of the target probabilities of all the instance data in the first data set, where the first average is used to indicate accuracy of the data collected according to the target type; a third determining module 564, configured to determine entropy values of the target probabilities of all the instance data in the first data set, where the entropy values are used to indicate promiscuity of the data acquired according to the target type, and the acquisition quality information includes the first average value and/or the entropy values.

Optionally, the third determining unit is further configured to use, as the target data set, a data set in which the first average value of the plurality of first data sets reaches the first preset value and/or the entropy value reaches the second preset value.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of embodiment 1 described above. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.

Example 3

According to the embodiment of the invention, the invention further provides a server or a terminal for implementing the data set determination method.

Fig. 8 is a block diagram of a terminal according to an embodiment of the present invention, and as shown in fig. 8, the terminal may include: one or more processors 801 (only one of which is shown), a memory 803, and a transmission apparatus 805 (such as the transmission apparatus in the above embodiment) as shown in fig. 8, the terminal may further include an input/output device 807.

The memory 803 may be used to store software programs and modules, such as program instructions/modules corresponding to the methods and apparatuses in the embodiments of the present invention, and the processor 801 executes various functional applications and data processing by operating the software programs and modules stored in the memory 803, so as to implement the above-described methods. The memory 803 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 803 may further include memory located remotely from the processor 801, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The above-mentioned transmission device 805 is used for receiving or sending data via a network, and may also be used for data transmission between a processor and a memory. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 805 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 805 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Among them, the memory 803 is used to store an application program, in particular.

The processor 801 may call an application stored in the memory 803 via the transmission means 805 to perform the following steps: receiving an instruction for instructing to acquire a target data set from a plurality of first data sets, wherein the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis; determining a target probability of each instance data in each first data set according to the first attribute set, wherein the target probability is the probability that the instance data belongs to a target type, and the first attribute set comprises an attribute used for indicating the data of the target type; determining acquisition quality information of each first data set based on the target probability of all the example data in each first data set, wherein the acquisition quality information is used for indicating the quality of the first data sets acquired according to the target type; and determining that the collected quality information in the first data sets meets the preset quality requirement as a target data set for data analysis.

The processor 801 is further configured to perform the following steps: acquiring a second attribute set of each instance data, wherein the second attribute set comprises attribute information of the instance data; a target probability for each instance data is determined based on the first set of attributes and the second set of attributes for each instance data.

By adopting the embodiment of the invention, when an instruction for indicating to acquire the target data set from the plurality of first data sets is received, the probability that the example data in the first data sets belong to the target type is determined through the first attribute set, then the acquisition quality information of the first data sets is determined based on the target probabilities of all the example data in the first data sets, and the target data sets meeting the preset quality requirement are selected for data analysis.

Optionally, the specific examples in this embodiment may refer to the examples described in embodiment 1 and embodiment 2, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the terminal may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, etc. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium described above may be used for a program code that executes the determination method of the data set.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

s11, receiving an instruction for instructing to acquire a target data set from a plurality of first data sets, wherein the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis;

s12, determining a target probability of each instance data in each first data set according to the first attribute sets, wherein the target probability is the probability that the instance data belongs to a target type, and the first attribute sets comprise attributes used for indicating data of the target type;

s13, determining the collection quality information of each first data set based on the target probability of all the example data in each first data set, wherein the collection quality information is used for indicating the quality of the first data sets collected according to the target type;

and S14, determining the collected quality information in the first data sets to meet the preset quality requirement as a target data set for data analysis.

Optionally, the storage medium is further arranged to store program code for performing the steps of:

s21, acquiring a second attribute set of each instance data, wherein the second attribute set comprises attribute information of the instance data;

s22, a target probability for each instance data is determined based on the first set of attributes and the second set of attributes for each instance data.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for determining a data set, comprising:

receiving an instruction for instructing to acquire a target data set from a plurality of first data sets, wherein the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis;

determining a target probability of each instance data in each first data set according to a first attribute set, wherein the target probability is a probability that the instance data belongs to the target type, and the first attribute set comprises an attribute used for indicating data of the target type;

determining acquisition quality information of each first data set based on target probabilities of all instance data in the first data sets, wherein the acquisition quality information is used for indicating the quality of the first data sets acquired according to the target type and belonging to the target type;

and determining the target data set for data analysis, wherein the acquired quality information in the first data sets meets the preset quality requirement.

2. The method of claim 1, wherein determining the target probability for each instance of data in each of the first data sets based on the first set of attributes comprises:

acquiring a second attribute set of each instance data, wherein the second attribute set comprises attribute information of the instance data;

determining a target probability for each of the instance data based on the first set of attributes and the second set of attributes for each of the instance data.

3. The method of claim 2, wherein the first set of attributes is at least one in number,

prior to determining a target probability for each of the instance data based on the first set of attributes and the second set of attributes for each of the instance data, the method further comprises: obtaining a confidence coefficient of each first attribute set, wherein the confidence coefficient is used for indicating the probability that the data with all the attribute information in the first attribute set belongs to the target type;

determining a target probability for each of the instance data based on the first set of attributes and the second set of attributes for each of the instance data comprises: determining a target matching degree of the instance data relative to each of the first attribute sets by matching the second attribute set of the instance data with each of the first attribute sets; taking the confidence of the first attribute set corresponding to the maximum matching degree in at least one target matching degree as the target probability of the example data.

4. The method of claim 3, wherein determining a target degree of matching of the instance data with respect to each of the first attribute sets by matching the second attribute set of the instance data with each of the first attribute sets comprises:

determining a first matching degree of attribute information in a second attribute set of the example data and attribute information in the first attribute set;

determining a second matching degree of the attribute information in the first attribute set and the attribute information in a second attribute set of the example data;

and determining the target matching degree of the example data relative to the first attribute set according to the first matching degree and the second matching degree.

5. The method of claim 4, wherein determining a first degree of matching of attribute information in the second set of attributes of the instance data to attribute information in the first set of attributes comprises:

determining the number of target attribute information in the second attribute set, which is matched with the attribute information in the first attribute set;

and taking the ratio of the number of the target attribute information to the number of the attribute information in the first attribute set as the first matching degree.

6. The method of claim 1, wherein determining the acquisition quality information for each of the first data sets based on the target probabilities for all instance data in each of the first data sets comprises at least one of:

determining a first average of the target probabilities of all instance data in the first data set, wherein the first average is used to indicate the accuracy of the data collected according to the target type;

determining entropy values of target probabilities of all instance data in the first data set, wherein the entropy values are used for indicating promiscuity of data acquired according to the target type, and the acquisition quality information comprises the first average value and/or the entropy values.

7. The method of claim 6, wherein determining an entropy value for a target probability for all of the instance data in the first set of data comprises:

determining the entropy value by taking a logarithm operation on target probabilities of all instance data in the first set of data.

8. The method of claim 6, wherein determining that the target data set for data analysis among the plurality of first data sets whose acquisition quality information meets a preset quality requirement comprises:

and taking the data with the first average value reaching a first preset value and/or the entropy value reaching a second preset value in the plurality of first data sets as the target data set.

9. The method of claim 6, wherein after determining the acquisition quality information for each of the first data sets based on the target probabilities for all instance data in each of the first data sets, the method further comprises:

acquiring a second average value of the first average values of the plurality of first data sets, wherein the data in the plurality of first data sets meet a preset relationship, the preset relationship is used for indicating the types of the data and the data, and the second average value is used for indicating the accuracy of the data acquired according to the preset relationship;

and acquiring a third average value of the entropy values of the plurality of first data sets, wherein the third average value is used for indicating the degree of mixing of the data acquired according to the preset relation.

10. The method of claim 1, wherein prior to determining the target probability for each instance of data in each of the first sets of data based on the first set of attributes, the method further comprises:

acquiring a second data set, wherein each data in the second data set belongs to the target type;

and performing data mining on the second data set to obtain the first attribute set.

11. An apparatus for determining a data set, comprising:

the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving an instruction for indicating to acquire a target data set from a plurality of first data sets, the first data sets comprise at least one example data collected according to a target type, and the data of the target data set is used for data analysis;

a first determining unit, configured to determine, according to a first attribute set, a target probability of each instance data in each first data set, where the target probability is a probability that the instance data belongs to the target type, and the first attribute set includes an attribute indicating data of the target type;

a second determining unit, configured to determine, based on a target probability of all instance data in each first data set, acquisition quality information of each first data set, where the acquisition quality information is used to indicate quality of the first data set acquired according to the target type and belonging to the target type;

and the third determining unit is used for determining that the target data set used for data analysis, of which the acquisition quality information meets the preset quality requirement, in the first data sets.

12. The apparatus according to claim 11, wherein the first determining unit comprises:

an obtaining module, configured to obtain a second attribute set of each instance data, where the second attribute set includes attribute information of the instance data;

a first determination module to determine a target probability for each of the instance data based on the first set of attributes and the second set of attributes for each of the instance data.

13. The apparatus of claim 12, wherein the first set of attributes is at least one in number, and wherein the first determining module comprises:

an obtaining sub-module, configured to obtain a confidence level of each first attribute set, where the confidence level is used to indicate a probability that data with all attribute information in the first attribute set belongs to the target type;

a determining submodule, configured to determine a target matching degree of the instance data with respect to each of the first attribute sets by matching the second attribute set of the instance data with each of the first attribute sets;

a processing submodule, configured to use a confidence of the first attribute set corresponding to a maximum matching degree of the at least one target matching degree as a target probability of the instance data.

14. The apparatus of claim 13, wherein the determination sub-module is further configured to:

15. The apparatus according to claim 11, wherein the second determining unit comprises:

a second determining module, configured to determine a first average of the target probabilities of all instance data in the first data set, where the first average is used to indicate accuracy of the data collected according to the target type;

a third determining module, configured to determine entropy values of target probabilities of all instance data in the first data set, where the entropy values are used to indicate a degree of mixing of data acquired according to the target type, and the acquisition quality information includes the first average value and/or the entropy values.

16. The apparatus according to claim 15, wherein the third determining unit is further configured to use a first average value of a plurality of the first data sets reaching a first preset value and/or an entropy value reaching a second preset value as the target data set.