CN110321377B - Multi-source heterogeneous data truth value determination method and device - Google Patents

Multi-source heterogeneous data truth value determination method and device Download PDF

Info

Publication number
CN110321377B
CN110321377B CN201910340361.3A CN201910340361A CN110321377B CN 110321377 B CN110321377 B CN 110321377B CN 201910340361 A CN201910340361 A CN 201910340361A CN 110321377 B CN110321377 B CN 110321377B
Authority
CN
China
Prior art keywords
value
data
truth
declaration
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910340361.3A
Other languages
Chinese (zh)
Other versions
CN110321377A (en
Inventor
许海涛
王铮
周贤伟
林福宏
吕兴
安建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910340361.3A priority Critical patent/CN110321377B/en
Publication of CN110321377A publication Critical patent/CN110321377A/en
Application granted granted Critical
Publication of CN110321377B publication Critical patent/CN110321377B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a device for determining a truth value of multi-source heterogeneous data, which can carry out joint processing on heterogeneous conflict data and improve the accuracy of truth value discovery. The method comprises the following steps: s1, heterogeneous conflict data from different data sources are obtained; s2, for the conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target; s3, aiming at each object, adopting a truth value selection strategy based on an exhaustion method to update the weight of all data sources; s4, calculating F values according to the updated weights of all data sources, judging whether the optimization model F is converged or not according to the obtained F values, and returning to S3 to continue execution if the optimization model F is not converged; and if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set. The present invention relates to the field of data mining.

Description

Multi-source heterogeneous data truth value determination method and device
Technical Field
The invention relates to the field of data mining, in particular to a method and a device for determining true values of multi-source heterogeneous data.
Background
With the advent of the big data age, data became just a huge treasure, and many websites and companies collected this data to serve governments, businesses, and the public. People can obtain the description of the same entity from various data sources, which brings great convenience to the life of people. But for the same thing, different data sources may provide different descriptions, where there is some unrealistic information that causes severe data conflicts and affects people's judgment of the truth of the fact.
To resolve data conflicts, an effective solution is truth discovery, i.e., discovering the true description from the conflicting data that describes the same entity. The declaration of a data source to different aspects of an entity can be heterogeneous, but existing truth value discovery methods only aim at single type of data, do not have joint processing capacity for heterogeneous data, and neglect a selection strategy for truth values, and cannot accurately find truth values.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a device for determining a true value of multi-source heterogeneous data, so as to solve the problem that heterogeneous data cannot be processed jointly and is easy to fall into local optimum in the prior art.
In order to solve the above technical problem, an embodiment of the present invention provides a method for determining a true value of multi-source heterogeneous data, including:
s1, heterogeneous conflict data from different data sources are obtained;
s2, for the conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables;
s3, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all objects;
s4, calculating F values according to the updated weights of all data sources, judging whether the optimization model F is converged or not according to the obtained F values, and returning to S3 to continue execution if the optimization model F is not converged; and if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set.
Further, for conflict data describing the same object, the optimization model F constructed with the goal of maximizing the weighted sum of the credibility of the declaration values for all objects of all data sources is represented as:
Figure GDA0002073493650000021
Figure GDA0002073493650000022
wherein, wnAs a data source SnN denotes the number of data sources, K denotes the number of objects, an,kAs a data source SnIs an object OkThe declaration value provided, f (A)n,k) To declare a value An,kS.t. represents a constraint, A*,kIs an object OkA declaration value of, TkIs an object OkIs true value of A*,kA subset of (a);
wherein object O for all data sourceskThe objective function g (k) constructed with the goal of maximizing the weighted sum of confidence of the declared values is expressed as:
Figure GDA0002073493650000023
wherein when the object OkWhen the target function G (k) is maximized by a declaration value of (A), the declaration value is the object OkThe true value of (c) is expressed as:
Figure GDA0002073493650000024
further, a data source SnWeight w ofnExpressed as:
Figure GDA0002073493650000025
further, declaring a value An,kReliability of (A)n,k) Expressed as:
Figure GDA0002073493650000031
wherein beta is a support factor, NkAs a data source SnProvided to the object OkThe number of declaration values of (1), sim (-) represents a similarity function, and sup (-) represents a declaration value support function.
Further, if the conflicting data is classified data, the similarity function is expressed as:
Figure GDA0002073493650000032
further, if the conflicting data is continuous data, the similarity function is expressed as:
Figure GDA0002073493650000033
further, the declaration value support is expressed as:
sup(An,k,Ai,k)=sim(An,k,Ai,k)。
further, for each object, selecting a truth value selection policy based on an exhaustive method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, the current reference truth value is the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all objects includes:
s31, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, and selecting a declaration value in a declaration value set as a reference truth value;
s32, calculating the similarity between each declaration value of the object and each reference truth value, determining the credibility of the declaration value by combining the support of the declaration value, and determining the G value according to the credibility of the declaration value and the weight of the data source obtained by the last iteration;
s33, judging whether the G value is the maximum value, if so, taking the current reference true value as the optimal true value of the current object; otherwise, returning to S31 to continue iteration;
and S34, forming a preliminary optimal truth value set by the obtained optimal truth values of all the objects, and updating the weights of all the data sources according to the statement value credibility sets corresponding to the obtained preliminary optimal truth value sets of all the objects.
An embodiment of the present invention further provides a device for determining a true value of multi-source heterogeneous data, including:
the acquisition module is used for acquiring heterogeneous conflict data from different data sources;
the construction module is used for respectively constructing an objective function G and an optimization model F which aim at maximizing the weighted sum of the credibility of the declaration values by taking the weight of the data source and the credibility of the declaration values as optimization variables for the conflict data describing the same object and each object of all data sources;
the updating module is used for selecting a truth value in the statement value set as a reference truth value by adopting a truth value selection strategy based on an exhaustion method aiming at each object, determining a G value according to the selected reference truth value, when the G value is the maximum, the current reference truth value is the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all the objects;
the determining module is used for calculating the F value according to the updated weights of all the data sources, judging whether the optimization model F is converged according to the obtained F value, and returning to the updating module for continuous execution if the optimization model F is not converged; and if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set.
The technical scheme of the invention has the following beneficial effects:
in the scheme, heterogeneous conflict data from different data sources are obtained; for conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables; aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all the objects; calculating the value F according to the updated weights of all data sources, judging whether the optimization model F is converged according to the obtained value F, and if so, forming an optimal truth value set by the obtained optimal truth values of all objects; therefore, the method can carry out combined processing on heterogeneous conflict data based on the truth value selection strategy of the exhaustion method, overcomes local optimal influence and accurately finds out a global optimal solution, thereby improving the accuracy of truth value discovery.
Drawings
Fig. 1 is a schematic flow chart of a method for determining a true value of multi-source heterogeneous data according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a principle of a truth value determining method for multi-source heterogeneous data according to an embodiment of the present invention;
fig. 3 is a detailed flowchart of a method for determining a true value of multi-source heterogeneous data according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a multi-source heterogeneous data truth value determining apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a method and a device for determining true values of multi-source heterogeneous data, aiming at the problems that the existing heterogeneous data cannot be processed in a combined mode and is easy to fall into local optimum.
Example one
As shown in fig. 1, the method for determining the truth value of multi-source heterogeneous data according to the embodiment of the present invention includes:
s1, heterogeneous conflict data from different data sources are obtained;
s2, for the conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables;
s3, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all objects;
s4, calculating F values according to the updated weights of all data sources, judging whether the optimization model F is converged or not according to the obtained F values, and returning to S3 to continue execution if the optimization model F is not converged; and if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set.
The method for determining the true value of the multi-source heterogeneous data obtains heterogeneous conflict data from different data sources; for conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables; aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all the objects; calculating the value F according to the updated weights of all data sources, judging whether the optimization model F is converged according to the obtained value F, and if so, forming an optimal truth value set by the obtained optimal truth values of all objects; therefore, the method can carry out combined processing on heterogeneous conflict data based on the truth value selection strategy of the exhaustion method, overcomes local optimal influence and accurately finds out a global optimal solution, thereby improving the accuracy of truth value discovery.
In this embodiment, for each object, a set of declaration values is provided, and declaration values in the set of declaration values are deduplicated without duplicate values.
In this embodiment, based on the observation of real world conflict data, the embodiment of the present invention proposes the following heuristic:
(1) the truth of an object appears the same or similar on most data sources;
(2) the more trustworthy claims the more data sources are provided, the more likely the true claims are provided;
(3) the more likely the declaration value provided by the data source with the higher weight is to be near the true value of the object.
In this embodiment, for the convenience of research, it is assumed that each object is independent and has only one true value. As shown in fig. 2, based on the above heuristic method, for conflict data describing the same object, a truth finding problem is converted into an optimization problem, and a multi-source heterogeneous data truth finding optimization model is designed, which aims to maximize a weighted sum of declaration value credibility, and updates two optimization variables in an iterative manner by taking a data source weight and declaration value credibility as optimization variables to continuously optimize the model, wherein the weight is a data source weight, and the optimization model is expressed as:
Figure GDA0002073493650000061
f is an optimization model which aims at maximizing the weighted sum of the credibility of the statement values and is constructed for all objects of all data sources; w is anAs a data source SnN denotes the number of data sources, K denotes the number of objects, an,kAs a data source SnIs an object OkThe declaration value provided, f (A)n,k) To declare a value An,kS.t. represents a constraint, A*,kIs an object OkOf a declarative value set, TkIs an object OkIs true value of A*,kA subset of (a).
In the optimization model, the data source weight and the statement value credibility are unknown variables, so that the model can be continuously optimized by adopting an iterative method to update the data source weight and the statement value credibility.
In this embodiment, since the objects are independent of each other, the object O for all data sourceskThe objective function g (k) constructed to aim at maximizing the weighted sum of confidence of the declared values can be expressed as:
Figure GDA0002073493650000062
when the object OkWhen the target function G (k) is maximized by a declaration value of (A), the declaration value is the object OkThe true value of (c) is expressed as:
Figure GDA0002073493650000071
in this embodiment, each iteration is regarded as one optimization, each optimization adopts a truth value selection strategy based on an exhaustive method, all declaration values in the declaration value set are sequentially used as reference truth values, and then a declaration value that a target function g (k) takes the maximum value is output and used as an object OkAnd then continuously optimizing the model until the optimization model F converges, wherein a truth set obtained by the optimization model at the moment is the optimal truth set.
In this embodiment, the data source weight is calculated from the credibility of the declaration value, and the data source SnWeight w ofnExpressed as:
Figure GDA0002073493650000072
wherein the content of the first and second substances,
Figure GDA0002073493650000073
representing a data source SnThe sum of the trustworthiness of all object claims values provided,
Figure GDA0002073493650000074
representing the sum of the declared value trustworthiness of all objects provided by all data sources.
In this embodiment, the declaration value credibility is composed of declaration value similarity and declaration value support, and the declaration value an,kReliability of (A)n,k) Expressed as:
Figure GDA0002073493650000075
wherein beta is a support factor, NkAs a data source SnProvided to the object OkThe number of declaration values of (1), sim (-) represents a similarity function, and sup (-) represents a declaration value support function.
In this embodiment, the conflict data may be divided into: the method comprises two data types of classified data and continuous data, wherein the classified data and the continuous data are two common heterogeneous data, corresponding similarity functions are given to the two data, and the data type of the classified data and the continuous data is determined according to the similarity functions,
and the classification data adopts a 0-1 similarity function, and then the similarity function of the classification data is expressed as:
Figure GDA0002073493650000076
the continuous data uses the normalized square root relative error as the similarity function, and the similarity function of the continuous data is expressed as:
Figure GDA0002073493650000081
in this embodiment, corresponding similarity functions are adopted for different types of data sources, and multiple types of data sources can be processed in a combined manner, so that the quality of the data sources is better evaluated, and the accuracy of true value discovery is improved.
In this embodiment, the declaration value is supported as the similarity between the declaration value and other declaration values describing the same object, and is expressed as follows:
sup(An,k,Ai,k)=sim(An,k,Ai,k) (8)
in the embodiment, the statement value support is introduced into the statement value credibility calculation, so that the solution of the optimization problem can be accelerated to get rid of local optimum, the solution is quickly close to the current optimum, the result accuracy is improved, and the convergence speed of the multisource heterogeneous data truth value determination method is accelerated.
In a specific implementation manner of the foregoing multi-source heterogeneous data true value determining method, further, the selecting, by using a true value selection policy based on an exhaustive method for each object, a declaration value in a declaration value set as a reference true value, determining a G value according to the selected reference true value, when the G value is the maximum, the current reference true value is the optimal true value of the current object, and updating the weights of all data sources according to the obtained optimal true values of all objects includes:
s31, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, and selecting a declaration value in a declaration value set as a reference truth value;
s32, calculating the similarity between each declaration value of the object and each reference truth value, determining the credibility of the declaration value by combining the support of the declaration value, and determining the G value according to the credibility of the declaration value and the weight of the data source obtained by the last iteration;
s33, judging whether the G value is the maximum value, if so, taking the current reference true value as the optimal true value of the current object; otherwise, returning to S31 to continue iteration;
and S34, forming a preliminary optimal truth value set by the obtained optimal truth values of all the objects, and updating the weights of all the data sources according to the statement value credibility sets corresponding to the obtained preliminary optimal truth value sets of all the objects.
In a specific implementation manner of the multi-source heterogeneous data true value determining method, further, the value F is calculated according to the updated weights of all the data sources, whether the optimization model F converges is determined according to the obtained value F, and if the optimization model F does not converge, the method returns to S3 to continue execution; if convergence occurs, the obtained optimal truth values of all the objects form an optimal truth value set, which includes:
calculating the value of the optimization model F according to the updated weights of all data sources and the statement value credibility set corresponding to the preliminary optimal truth value set of all objects;
judging whether the optimization model F is converged or not according to the obtained F value, and returning to S3 for continuous execution if the optimization model F is not converged; and if the current initial optimal truth value set is converged, the current initial optimal truth value set is the final optimal truth value set.
As shown in fig. 3, in order to better understand the multi-source heterogeneous data truth value determining method described in this embodiment, a workflow of the method is described in detail, and the method may specifically include the following steps:
firstly, data mining is carried out through modes such as a web crawler and the like, heterogeneous conflict data from different data sources (such as micro blogs such as New wave and Tencent, search engines such as public numbers, hundredths and Google) are obtained from a network, and weights of all the data sources are initialized to the same value;
step two, before iteration starts, calculating declaration value support of all declaration values according to the formula (8);
step three, iteration is started, after each iteration is started, a truth value selection strategy based on an exhaustion method is adopted for each object, statement values in statement value sets are sequentially selected as reference truth values, the similarity between each statement value of the object and the reference truth value is calculated for each reference truth value, the statement value credibility is determined through an equation (5) in combination with statement value support, a G value is calculated according to an equation (2) according to the determined statement value credibility and the weight of a data source obtained by the last iteration, and when the G value is maximum, the reference truth value is the optimal truth value;
step four, the obtained optimal truth values of all the objects form an initial optimal truth value set, the weights of all the data sources are updated through a formula (4) according to a statement value credibility set corresponding to the obtained initial optimal truth value set of all the objects, and the iteration is finished; otherwise, returning to the third step to continue iteration;
and step five, calculating the value of the optimization model F according to the updated weights of all the data sources and the statement value credibility sets corresponding to the initial optimal truth value sets of all the objects, repeating the iteration process of the step three and the step four until the optimization model F is converged, and then outputting the current truth value set as the optimal truth value set.
The method for determining the true value of the multi-source heterogeneous data can be used in the field of data mining, the conflict data mined on the network are subjected to data cleaning, the true value is found from the conflict data, the information quality of the network environment is improved, more accurate information service is provided for the public, enterprises and governments, and loss caused by the conflict data is reduced.
The technical scheme of the invention has the following beneficial effects:
(1) the multisource heterogeneous data truth value discovery optimization model provided by the invention solves truth values through an optimization method, is not easily influenced by data distribution, adopts corresponding similarity functions aiming at different types of data, and can carry out combined processing on various types of data, thereby better evaluating the quality of data sources and improving the accuracy of truth value discovery.
(2) The invention introduces the statement value support into the calculation of the statement value credibility, and the addition of the statement value support can accelerate the solution of the optimization problem to get rid of local optimization, quickly get close to the current optimization, improve the result accuracy and accelerate the algorithm convergence speed.
(3) The invention provides a truth value selection strategy based on an exhaustion method, and although the exhaustion method is complex, the truth value selection strategy can overcome the local optimal influence and accurately find a global optimal solution.
Example two
The present invention further provides a specific embodiment of a multi-source heterogeneous data true value determining apparatus, and the multi-source heterogeneous data true value determining apparatus provided by the present invention corresponds to the specific embodiment of the multi-source heterogeneous data true value determining method, and the multi-source heterogeneous data true value determining apparatus can achieve the object of the present invention by executing the process steps in the specific embodiment of the method, so the explanation in the specific embodiment of the multi-source heterogeneous data true value determining method is also applicable to the specific embodiment of the multi-source heterogeneous data true value determining apparatus provided by the present invention, and will not be described in detail in the following specific embodiment of the present invention.
As shown in fig. 4, an embodiment of the present invention further provides a multi-source heterogeneous data truth value determining apparatus, including:
the acquiring module 11 is configured to acquire heterogeneous conflict data from different data sources;
the construction module 12 is configured to respectively construct, for each object and all objects of all data sources, an objective function G and an optimization model F that aim at maximizing a weighted sum of confidence values of the declaration values, with the data source weight and the declaration value confidence as optimization variables for conflict data describing the same object;
an updating module 13, configured to select, for each object, a truth value selection policy based on an exhaustive method, to use a declaration value in the declaration value set as a reference truth value, determine a G value according to the selected reference truth value, when the G value is the largest, the current reference truth value is the optimal truth value of the current object, and update the weights of all data sources according to the obtained optimal truth values of all objects;
the determining module 14 is configured to calculate an F value according to the updated weights of all the data sources, determine whether the optimization model F converges according to the obtained F value, and return to the updating module 13 to continue execution if the optimization model F does not converge; and if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set.
The multi-source heterogeneous data truth value determining device obtains heterogeneous conflict data from different data sources; for conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables; aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all the objects; calculating the value F according to the updated weights of all data sources, judging whether the optimization model F is converged according to the obtained value F, and if so, forming an optimal truth value set by the obtained optimal truth values of all objects; therefore, the method can carry out combined processing on heterogeneous conflict data based on the truth value selection strategy of the exhaustion method, overcomes local optimal influence and accurately finds out a global optimal solution, thereby improving the accuracy of truth value discovery.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (6)

1. A multi-source heterogeneous data truth value determining method is characterized by comprising the following steps:
s1, heterogeneous conflict data from different data sources are obtained;
s2, for the conflict data describing the same object, aiming at each object and all objects of all data sources, respectively constructing an objective function G and an optimization model F which take the weighted sum of the maximum statement value credibility as the target by taking the data source weight and the statement value credibility as optimization variables;
s3, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, selecting a declaration value in a declaration value set as a reference truth value, determining a G value according to the selected reference truth value, when the G value is the maximum, taking the current reference truth value as the optimal truth value of the current object, and updating the weights of all data sources according to the obtained optimal truth values of all objects;
s4, calculating F values according to the updated weights of all data sources, judging whether the optimization model F is converged or not according to the obtained F values, and returning to S3 to continue execution if the optimization model F is not converged; if the convergence is achieved, the obtained optimal truth values of all the objects form an optimal truth value set;
for the conflict data describing the same object, aiming at all objects of all data sources, an optimization model F which is constructed and aims at maximizing the weighted sum of the credibility of the statement values is expressed as follows:
Figure FDA0003056023920000011
Figure FDA0003056023920000012
wherein, wnAs a data source SnWeight of (1), N represents a numberAccording to the number of sources, K represents the number of objects, An,kAs a data source SnIs an object OkThe declaration value provided, f (A)n,k) To declare a value An,kS.t. represents a constraint, A*,kIs an object OkA declaration value of, TkIs an object OkIs true value of A*,kA subset of (a);
wherein object O for all data sourceskThe objective function g (k) constructed with the goal of maximizing the weighted sum of confidence of the declared values is expressed as:
Figure FDA0003056023920000013
wherein when the object OkWhen the target function G (k) is maximized by a declaration value of (A), the declaration value is the object OkThe true value of (c) is expressed as:
Figure FDA0003056023920000021
wherein the data source SnWeight w ofnExpressed as:
Figure FDA0003056023920000022
wherein a value A is declaredn,kReliability of (A)n,k) Expressed as:
Figure FDA0003056023920000023
wherein beta is a support factor, NkAs a data source SnProvided to the object OkThe number of declaration values of (1), sim (-) represents a similarity function, and sup (-) represents a declaration value support function.
2. The method for determining the truth value of multi-source heterogeneous data according to claim 1, wherein if the conflicting data is classified data, the similarity function is expressed as:
Figure FDA0003056023920000024
3. the method for determining the truth value of multi-source heterogeneous data according to claim 2, wherein if the conflict data is continuous data, the similarity function is expressed as:
Figure FDA0003056023920000025
4. the method for determining the truth value of multi-source heterogeneous data according to claim 3, wherein the declaration value support is expressed as:
sup(An,k,Ai,k)=sim(An,k,Ai,k)。
5. the multi-source heterogeneous data true value determining method according to claim 1, wherein for each object, a true value selection policy based on an exhaustive method is adopted, a declaration value in a declaration value set is selected as a reference true value, a G value is determined according to the selected reference true value, when the G value is the largest, the current reference true value is the optimal true value of the current object, and updating the weights of all data sources according to the obtained optimal true values of all objects includes:
s31, aiming at each object, adopting a truth value selection strategy based on an exhaustion method, and selecting a declaration value in a declaration value set as a reference truth value;
s32, calculating the similarity between each declaration value of the object and each reference truth value, determining the credibility of the declaration value by combining the support of the declaration value, and determining the G value according to the credibility of the declaration value and the weight of the data source obtained by the last iteration;
s33, judging whether the G value is the maximum value, if so, taking the current reference true value as the optimal true value of the current object; otherwise, returning to S31 to continue iteration;
and S34, forming a preliminary optimal truth value set by the obtained optimal truth values of all the objects, and updating the weights of all the data sources according to the statement value credibility sets corresponding to the obtained preliminary optimal truth value sets of all the objects.
6. The method for determining the true value of the multi-source heterogeneous data according to claim 5, wherein the F value is calculated according to the updated weights of all the data sources, whether the optimization model F converges is judged according to the obtained F value, and if the optimization model F does not converge, the method returns to S3 to continue execution; if convergence occurs, the obtained optimal truth values of all the objects form an optimal truth value set, which includes:
calculating the value of the optimization model F according to the updated weights of all data sources and the statement value credibility set corresponding to the preliminary optimal truth value set of all objects;
judging whether the optimization model F is converged or not according to the obtained F value, and returning to S3 for continuous execution if the optimization model F is not converged; and if the convergence is achieved, the current preliminary optimal truth set is the optimal truth set.
CN201910340361.3A 2019-04-25 2019-04-25 Multi-source heterogeneous data truth value determination method and device Active CN110321377B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910340361.3A CN110321377B (en) 2019-04-25 2019-04-25 Multi-source heterogeneous data truth value determination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910340361.3A CN110321377B (en) 2019-04-25 2019-04-25 Multi-source heterogeneous data truth value determination method and device

Publications (2)

Publication Number Publication Date
CN110321377A CN110321377A (en) 2019-10-11
CN110321377B true CN110321377B (en) 2021-07-23

Family

ID=68113240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910340361.3A Active CN110321377B (en) 2019-04-25 2019-04-25 Multi-source heterogeneous data truth value determination method and device

Country Status (1)

Country Link
CN (1) CN110321377B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535693B (en) * 2020-04-20 2023-04-07 中国移动通信集团湖南有限公司 Data true value determination method and device for mobile platform and electronic equipment
CN111708816B (en) * 2020-05-15 2022-12-09 西安交通大学 Multi-truth-value conflict resolution method based on Bayesian model
CN115932702B (en) * 2023-03-14 2023-05-26 武汉格蓝若智能技术股份有限公司 Virtual standard based voltage transformer online operation calibration method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933052A (en) * 2014-03-17 2015-09-23 华为技术有限公司 Data true value estimation method and data true value estimation device
CN107193967A (en) * 2017-05-25 2017-09-22 南开大学 A kind of multi-source heterogeneous industry field big data handles full link solution
CN108564101A (en) * 2017-12-29 2018-09-21 天津南大通用数据技术股份有限公司 A kind of data fusion method and device based on more hierarchical cluster attributes
CN109284316A (en) * 2018-09-11 2019-01-29 中国人民解放军战略支援部队信息工程大学 True value based on data source Multi-attributes finds method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10552486B2 (en) * 2016-05-26 2020-02-04 International Business Machines Corporation Graph method for system sensitivity analyses
US20180365779A1 (en) * 2017-06-14 2018-12-20 Global Tel*Link Corporation Administering pre-trial judicial services

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933052A (en) * 2014-03-17 2015-09-23 华为技术有限公司 Data true value estimation method and data true value estimation device
CN107193967A (en) * 2017-05-25 2017-09-22 南开大学 A kind of multi-source heterogeneous industry field big data handles full link solution
CN108564101A (en) * 2017-12-29 2018-09-21 天津南大通用数据技术股份有限公司 A kind of data fusion method and device based on more hierarchical cluster attributes
CN109284316A (en) * 2018-09-11 2019-01-29 中国人民解放军战略支援部队信息工程大学 True value based on data source Multi-attributes finds method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MTruths:Web信息多真值发现方法;马如霞 等;《计算机研究与发展》;20161215;第53卷(第12期);第2858-2866页 *
基于多蚁群同步优化的多真值发现算法;冯钦 等;《计算机软件及计算机应用》;20181105;第37卷(第1期);第44-49页 *

Also Published As

Publication number Publication date
CN110321377A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110321377B (en) Multi-source heterogeneous data truth value determination method and device
Cheng et al. Evolutionary multiobjective optimization-based multimodal optimization: Fitness landscape approximation and peak detection
Nguyen et al. Pay-as-you-go reconciliation in schema matching networks
US10157239B2 (en) Finding common neighbors between two nodes in a graph
Meng et al. DPCMNE: detecting protein complexes from protein-protein interaction networks via multi-level network embedding
CN104992078B (en) A kind of protein network complex recognizing method based on semantic density
Yu et al. Predicting protein complex in protein interaction network-a supervised learning based method
CN111599406A (en) Global multi-network comparison method combined with network clustering method
Wang et al. A novel graph clustering method with a greedy heuristic search algorithm for mining protein complexes from dynamic and static PPI networks
CN109727637B (en) Method for identifying key proteins based on mixed frog-leaping algorithm
Shen et al. The application of artificial intelligence to the bayesian model algorithm for combining genome data
CN113011559A (en) Automatic machine learning method and system based on kubernets
Swidan et al. A model for processing skyline queries in crowd-sourced databases
Kim et al. Trends in neural architecture search: Towards the acceleration of search
Chen et al. Fast community detection based on distance dynamics
Cruz et al. Quality-based model for effective and robust multi-user pay-as-you-go ontology matching 1
Cao et al. An Adaptive Self‐Organizing Migration Algorithm for Parameter Optimization of Wavelet Transformation
CN116595543A (en) Processing system for developing application data by software based on Internet platform
CN116662893A (en) Water quality prediction method for optimizing SVM (support vector machine) based on improved goblet sea squirt algorithm
CN116192538A (en) Network security assessment method, device, equipment and medium based on machine learning
Zhang et al. AFOA: an adaptive fruit fly optimization algorithm with global optimizing ability
Zhou et al. ARM: toward adaptive and robust model for reputation aggregation
Zhou et al. Parallel heuristic community detection method based on node similarity
Liu et al. Scalable species tree inference with external constraints
CN111950687A (en) Method for solving minimum attribute reduction by combining local opponent learning and social spider algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant