CN110147393A - The entity resolution method in data-oriented space - Google Patents
The entity resolution method in data-oriented space Download PDFInfo
- Publication number
- CN110147393A CN110147393A CN201910435269.5A CN201910435269A CN110147393A CN 110147393 A CN110147393 A CN 110147393A CN 201910435269 A CN201910435269 A CN 201910435269A CN 110147393 A CN110147393 A CN 110147393A
- Authority
- CN
- China
- Prior art keywords
- attribute
- record
- similarity
- mapping
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The entity resolution method in data-oriented space, the present invention relates to entity resolution methods.The purpose of the present invention is to solve existing in data space when carrying out entity resolution, to compare to record, for the record pair of different field, matching probability very little, paired comparison can waste of resource the problem of.Process are as follows: Step 1: building record figure: Step 2: simplifying record figure using pruning method;Step 3: carrying out piecemeal processing to the record figure after cutting;Step 4: establishing attribute mapping cluster;Step 5: the goodness of computation attribute mapping ensemblen;Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.The present invention parses field for data entity.
Description
Technical field
The present invention relates to entity resolution methods.
Background technique
Entity resolution refers to the process of the different description forms for identifying same entity, it is intended to ensure the quality of data, be data
Key technology in cleaning, data integration and data mining[1](Vasilis Efthymiou,Kostas Stefanidis,
Vassilis Christophides.Big Data Entity Resolution:From Highly to Somehow
Similar Entity Descriptions in the Web[C]//Proceeding Big Data’15Proceedings
of the 2015IEEE International Conference on Big Data,2015,11(1):401-410P).?
In traditional entity resolution work, most of work is dependent on the mode or Semantic mapping between data.Data space is a kind of
New data integration mode, its not stringent data pattern and Semantic mapping, but according to the demand of main body gradually by data
It is included in simultaneously opening relationships, is a kind of heterogeneous data set, its main feature is that data come from multiple data sources[2](Ge Jingjun, Hu Changjun,
The small-sized microcomputer system of virtual data space Share Model [J] of Liu Xin domain-oriented science, 2014,35 (3): 514-
519PGE Jingjun,HU Changjun,LIU Xin.Virtual Data Space Sharing Model for
Domain Science[J].Minicomputer System,2014,35(13):514-519P).It is carried out in data space
When entity resolution, the powerful of entity resolution, Semantic mapping are just lost.Entity resolution will compare record, for
The record pair of different field, matching probability very little, paired comparison can waste of resource.
Summary of the invention
The purpose of the present invention is to solve existing in data space when carrying out entity resolution, to carry out to record pair
Than, for the record pair of different field, matching probability very little, paired comparison can waste of resource the problem of, and propose data-oriented
The entity resolution method in space.
The entity resolution method detailed process in data-oriented space are as follows:
Step 1: building record figure:
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.
The invention has the benefit that
The invention proposes partitions[3](Batya Kening,Avigdor Gal.MFIBlocks:An
effective blocking algorithm for entity resolution[J].Information Systems,
2013,38 (6): 908-926P), i.e., data are prejudged using a kind of cost lower calculation method, it is possible to belong to same
The data record of one entity is put in one block, and Record Comparison is only carried out in block.It solves and existing is carried out in data space
When entity resolution, record is compared, for the record pair of different field, matching probability very little, paired comparison can be wasted
The problem of resource.
This paper data-oriented space parses multi-resources Heterogeneous data entity and carries out theoretical research.Even if in view of in no semanteme
It under mapping, is directed toward two of same entity and is recorded on its attribute value and also interosculate, and the relationship between record is included in
It calculates, both comprehensive building record figure.Record is simplified by its applicable pruning method for the set of records ends of different situations
Figure, and propose the algorithm that piecemeal is carried out according to the record figure after beta pruning.
When doing entity resolution in block, attribute is done using attribute value and is mapped, by obtaining overall data record in block
The information that attribute-name is referred to, will have common value with available data in block but still unmatched data separation comes, and propose one
Kind be similar to regular expression method, the similarity of computation attribute value, and to matching record attribute mapping attribute value into
Row merges, to return to user one more comprehensive entity information.
By experimental verification, method proposed in the present invention has entity resolution certain positive impetus.
Detailed description of the invention
Fig. 1 is that present invention building records map flow chart;
Fig. 2 is that the present invention schemes record to carry out beta pruning flow chart;
Fig. 3 is to carry out piecemeal flow chart according to the record figure after beta pruning;
Fig. 4 a is the datagram of heterogeneous attribute mapping;
Fig. 4 b is the global property mapping graph of heterogeneous attribute mapping;
Fig. 4 c is that the attribute of heterogeneous attribute mapping maps the goodness calculating figure of cluster;
Fig. 5 a is on both data sets that two methods are with figure of changing therewith;
Fig. 5 b is on both data sets that two methods are with figure of changing therewith;
Fig. 6 is that the entity of two kinds of algorithms generates comparison diagram.
Specific embodiment
Specific embodiment 1: the entity resolution method detailed process in embodiment of the present invention data-oriented space are as follows:
Step 1: building record figure:
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: the progress entity resolution work in block in attribute mapping cluster after the goodness of each mapping ensemblen is obtained, from
And exclude accidentally to be included in this block, it is directed toward the data record of other entities.
Specific embodiment 2: the present embodiment is different from the first embodiment in that, note is constructed in the step 1
Record figure;Detailed process are as follows:
Data record is indicated using a kind of stamp methods, regards data record as a property value set;It is based on one at this time
Kind common-sense is assumed[4]([4]S.Prabhakar Benny,S.Vasavi Dr,P.Anupriya.Hadoop Framework
For Entity Resolution Within High Velocity Streams[J].Procedia Computer
Science, 2016,85:550-557P), if two records are directed toward same entity, they necessarily include some identical categories
Property value.And count the relationship between data record, improve accuracy[5](Xiao Qihua, Chen Ke, Huang Dongmei consider space phase
Data space feature extraction method [J] Computer Simulation of closing property, 2014,31 (12): 425-428,433P XIAO
Qihua,CHEN Ke,HUANG Dongmei.Data Spatial Feature Extraction Method
Considering Spatial Relevance[J].Computer Simulation,2014,31(12):425-428,
433P).Graph model is recorded using one to indicate the record node and record node relationship in data space;By calculating two
Similarity between record draws a line between two records, and side right is similarity value.The piecemeal side of this label style
Method, since its representation method is simple, it is only necessary to obtain the attribute value of record, and not depend on fixed data pattern and strong penetrate
Semanteme, so can have very powerful applicability on the heterogeneous data collection in face of data space.
Equipped with set of records ends R={ r1{FullName:Tom Lloyd Malik;Job:producer,Actor;
Address:L.A.},r2{Name:Tom Malik;Producer;birthPlace:L.A.},r3{Label:Mike
Styles;Profession:producer;Place_of_birth:L.A.;Place_of_birth:1964},r4{Mike
Harry Styles;birthPlace:L.A.;Gender:male},r5{FullName:Harry Green;Address:
LOS;Sex:male;Profession:Writher},r6{Label:Harry Green;Gender:male;birthYear:
1980;married}}.Record figure method of partition general view based on set of records ends R is as shown in Figure 1, Figure 2, Figure 3 shows (in order to which simplification is shown
Example diagram, set of records ends wouldn't mark their relationship in figure).
Similarity between two step 1 one, calculating records;
Step 1 two constructs record figure according to the record and similarity of data space.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 3: the present embodiment is different from the first and the second embodiment in that, in the step 1 one
Calculate the similarity between two records;Detailed process are as follows:
One record can be converted to a tally set (i.e. tag:r by label transfer function tag ()i→T(ri)).It calculates
The label similarity of two records can be obtained in the ratio of two set intersections and union size.
Step 1 one by one, calculate label similarity:
Record is switched into tag set by label transfer function tag (), the label similarity of two records is calculated, is denoted as
simtag(ri,rj):
Wherein, T (ri) it is that r will be recorded by label transfer functioniThe standardization tally set being converted into;T(rj) it is to pass through mark
Label transfer function will record rjThe standardization tally set being converted into;
Step 1 one or two, calculated relationship similarity:
Two comprehensive similarities recorded in possessed all relationships are incorporated, sim is denoted asrel(ri,rj):
Wherein, Nbr (ri) indicate and record riThere are the set of records ends of connection, Nbr (r in rel relationshipj) indicate and record
rjThere is the set of records ends of connection in rel relationship, REL is indicated in record r1、r2All record set of relationship of upper appearance, such as
Cooperative relationship, teacher-student relationship, relationship etc. of teaching;
Step 1 one or three integrates label similarity and relationship similarity, obtains comprehensive similarity sim (ri,rj):
sim(ri, rj)=α simtag(ri, rj)+(1-α)·simrel(ri, rj)
Wherein, α indicates the weight of label similarity.
Other steps and parameter are the same as one or two specific embodiments.
Specific embodiment 4: unlike one of present embodiment and specific embodiment one to three, the step 1
According to the record of data space and similarity building record figure in two;Detailed process are as follows:
The set of records ends R of a given data space, constructs a non-directed graph G=(R, E), referred to as record figure;
Wherein R is set of records ends, represents the record in data space;E be side collection, two record between there are a line generations
The similarity of table record pair.
Record figure building after the completion of, scheme in side right it is lesser record it is lower to matching probability, by record figure side into
Row processing, reduce unnecessary Record Comparison compared with.
Other steps and parameter are identical as one of specific embodiment one to three.
Specific embodiment 5: unlike one of present embodiment and specific embodiment one to four, the step 2
It is middle that record figure is simplified using pruning method;Detailed process are as follows:
According to certain lesser side of redundant rule elimination weight, i.e., cut operator is made to figure, reduce the matching times of redundancy.
For the ease of next describing, the definition in region is provided.
Define 4. regions: the form that a record r occurs in record figure G is a node, this record node itself,
Have therewith while neighbours' record and connect their the while region that is constituted and become a point region.Point region is one in record figure G
A subgraph, is denoted as Gr={ { r } ∪ Rr,Er}.Wherein, RrFor the set that the neighbor node for having side to be connected with r is constituted, ErFor connection
The set that the side of r and neighbor node is constituted.
Mainly there are two component parts for beta pruning process: beta pruning center and prune rule.
Beta pruning center can be divided into two classes: side centralization, be selected by the side collection of traversing graph global optimal to be compared
It is right, the side for being unsatisfactory for prune rule is filtered out with this;Node centralization, all nodes in its traversing graph, it is intended to for one
A record node is found best to be compared to collection with it in its point region --- that is, the side right being connected with this record is maximum
Several record.
Prune rule is classified as weight threshold and radix threshold value according to function.Weight threshold, which specifies, retains side most
Small weight deletes all sides lower than this weight;Radix threshold value gives the Maximum edge numbers for retaining side in figure, retains side right top-
The side of k.Wherein, radix threshold value defines to be compared pair of quantity, is suitable for having to the conditional application of time resource.Weight
Threshold value decides whether to trim the side of linkage record pair according to the matching probability that record has itself, suitable for valuing validity
Application program.Prune rule is divided into global threshold and local threshold according to sphere of action.Global threshold is suitable for entire figure,
Namely side all in figure;And local threshold is suitable for a subset of figure, i.e., the point region of one node.
Pruning method use while centralization radix beta pruning, node centralization radix beta pruning, while centralization threshold value cut
One of branch or the threshold value beta pruning of node centralization;
(1) the radix beta pruning of side centralization:
Global radix threshold value k, which specifies record figure, will retain the sum on side, retain the side of k maximum weight;It can be according to
The low side of weight is effectively deleted in weight opposite side collection descending sort.
(2) the radix beta pruning of node centralization:
For each node ri, retain link node riTop-k weight k of maximum weight (be exactly while) while;For
Node ri, record node riPoint region Gri={ { ri}∪Rri,EriAnd current record node radix threshold value kri, traverse subgraph
In side, retain weight top-k (total n side connect with ri, reservation the highest k side of weight, it is other to delete) while
Other connections r is deleted simultaneouslyiSide;
Wherein, RriFor with record node riThe set that the neighbor node for having side connected is constituted, EriFor linkage record node
riThe set constituted with the side of neighbor node;Point region GriFor a subgraph in record figure G;
In general, the radix threshold value of each node should depend on side collection size (such as k in its regionri=0.1 × | Eri
|)。
(3) the threshold value beta pruning of side centralization:
Beta pruning is carried out in global scope using weight threshold, chooses minimum side right wmin(experiment obtains) owns in traversing graph
Weight is lower than w by sideminEdge contract;
It has traversed side all in figure and has deleted weight lower than preset threshold wminSide, remaining side is simultaneously in reserved graph
Output.Under normal circumstances, the weight between matching record is greater than the weight mismatched between record, therefore, chooses wminTarget is just
It is the equalization point determined between the two.
(4) the threshold value beta pruning of node centralization:
Selection to beta pruning range then uses a unified threshold to all nodes of record figure according to global threshold
Value, beta pruning process is identical as the weight beta pruning scheme of side centralization at this time;It, then can be according to user's need according to local threshold
It wants, chooses a specific threshold for special node, substantially, it applies the weight beta pruning of side centralization in node riPoint
On region.Main difference with the threshold value beta pruning scheme of side centralization is that it can use different threshold values for each node.
It obtains r firstiPoint region Gri={ { ri}∪Rri,Eri, subgraph is then specified according to the local threshold standard of input and is cut
The minimum side right of branch;Then, E is traversedriIn side, by weight be less than given threshold edge contract.
Other steps and parameter are identical as one of specific embodiment one to four.
Specific embodiment 6: unlike one of present embodiment and specific embodiment one to five, the step 3
In to after cutting record figure carry out piecemeal processing;Detailed process are as follows:
Figure after beta pruning is G={ R, E }, and R is set of records ends, and E is side collection;
Appoint and takes a record ri, create a block biAnd by riIt is placed in biIn, if riNode r in point regionjWith biIn
All nodes have Bian Xianglian, then by rjIt is placed in biIn, and delete rjWith biIn all nodes connected side, repeat this operation
Until traversal riAll neighbor nodes;At this point, if biIn node become an isolated node, boundless phase therewith in figure
Even, then this node is deleted from figure;Step 3 is repeated until figure G is sky;At this point, piecemeal work is completed, set of blocks B=is obtained
{b1,b2,...,b|B|};
Record Comparison and connection based on attribute mapping
By piecemeal, the record information in block includes a large amount of duplicate attribute values, helps to map attribute.Pass through
The characteristics of observing global data and the processing for making Semantic mapping, it can be found that wherein the information of one or several data record has
Some different places can divide away this record to improve accuracy at this time.Matched record information is integrated, is used
Needed for family is asked for, the time of processing record information can be reduced.
The mapping of heterogeneous attribute
Common method is using unified attribute come computation attribute value similarity in entity resolution.But in data space
In the environment of this multi-resources Heterogeneous, mapped without accurate attribute, it is possible to carry out attribute using attribute value in turn
Match.
Instantiation: an entity record is made of one group of attribute and value, and the value of an attribute, it is understood that there may be mistake is spelled
It writes or null value.Null value can not be compared.If an entity claims as described by a value not attribute for sky
This entity of this attribute instanceization.Entire attribute semantemes mapping process is as shown in Fig. 4 a, 4b, 4c.
Other steps and parameter are identical as one of specific embodiment one to five.
Specific embodiment 7: unlike one of present embodiment and specific embodiment one to six, the step 4
In establish attribute mapping cluster;Detailed process are as follows:
Step 4 one, for two attributes from different entities, the similar value of computation attribute:
1) attribute-name similarity (can use editing distance or other suitable similarity of character string flow functions), is denoted as
SL, obtained by the attribute-name after two standardization of comparison (according to data set feature, can specification capital and small letter, extension is referred to as
Full name);
Can be according in data set, the size of attribute-name similarity, to decide whether for this part to be included in calculating.
2) (two attributes, they may have different values to attribute value similarity on different records, as attribute att1 exists
There are 3 values in set of records ends, attribute att2 there are 4 values in set of records ends, then chooses two most like values of att1 and att2
Attribute value similarity as two attributes), it is denoted as SV, all values of two attributes are compared, and retain highest similarity and obtain
Point.
Obtain attributes match pair based on attribute-name similarity and attribute value similarity based method, dependence matching in set,
Attributes match collection is obtained by calculation, the collection of attributes match collection is collectively referred to as attributes match cluster.
The attribute that attributes match is concentrated exactly matches between each other.It includes many loose that method, which refuses an attribute mapping ensemblen,
Relevant attribute, and follow widely used without repetition hypothesis[6](Imen Megdiche,Oliver Teste,Cassia
Trojahn.An extensible linear approach for holistic ontology matching[C].In
ISWC, Part I, vol.LNCS 9981.Springer, Kobe, Japan, 2016:393-410P), a title is limited at this time
Each attribute under space (data source that an entity or one have semantic restriction) can at most match another name space
Under an attribute, be arranged overall situation 1:1 matching constraint[7](Chuncheng Xiang,Baobao Chang,Zhifang
Sui.An ontology matching approach based on affinity-preserving random walks
[C]//Proceeding IJCAI'15Proceedings of the 24th International Conference on
Artificial Intelligence,2015:1471-1477P).However, attribute mapping ensemblen under overall situation 1:1 matching constraint
Derivation is not a simple process, because an attribute frequently involves more than one matched attribute pair, and is simply chosen
With highest matching probability estimation to may result in conflict.
The process for obtaining entire attribute mapping cluster is known as global property mapping, it cooperates collection an attributes match
It is inputted for one, returns to an attribute mapping cluster.If I is NameSpace number, J is matched attributes match logarithm, for two
A different NameSpace ns、nt, ns、ntUnder attribute number be denoted as Ms、Mt, nsUnder a-th of attribute, ntUnder b-th of attribute point
P is not denoted as ita s、pb t, optimize integrity attribute matching by maximizing total matching probability of all match attributes pair, met with this
Global 1:1 limitation:
Wherein σ (pa s,pb t) it is attribute to pa sAnd pb tMatching probability;Θ(pa s,pb t) it is indicator function, when selection attribute
To pa s、pb tCome when forming an attribute mapping ensemblen, otherwise functional value 1 is 0.
For being formed shown in the algorithm 1 of attributes match cluster by match attribute to set:
Attributes match is arranged (line 2) by matching probability descending to set by step 4 two, is sequentially handled, for attribute
To pa、pbIf not separately including p in attribute mapping cluster Na、pbAttribute mapping ensemblen Ni、Nj, then attribute is added into N reflect
Penetrate collection { pa、pb}(line 6-7);If it includes p in cluster N that attribute, which maps,aAttribute mapping ensemblen NiAnd it is wrapped in attribute mapping cluster N
Containing coming from and pb(data source is exactly a NameSpace to the attribute of same NameSpace, if which uncertain record comes from
A data source, then this records oneself itself is a NameSpaces.Attribute with pb from same NameSpace, that is,
It says, this attribute and pb belong to a record, are one and record two attributes for including.), delete pa≈pbThis attribute pair
(line 8-9);Otherwise (if for above-mentioned two negative, first if it is saying, if do not had in attribute mapping cluster N
Separately include pa、pbAttribute mapping ensemblen Ni、NjIf second if it is say attribute mapping cluster N in exist comprising paAttribute
Mapping ensemblen NiAnd wherein comprising coming from and p in attribute mapping cluster NbThe attribute of same NameSpace.It otherwise is exactly both of these case
All ungratified situation.), merge Ni、NjThe attribute mapping ensemblen N bigger as onek, by NkIt is added in N, while being deleted from N
Except Ni、Nj(line 11-12);Step 4 two is repeated until J is sky.
Other steps and parameter are identical as one of specific embodiment one to six.
Specific embodiment 8: unlike one of present embodiment and specific embodiment one to seven, it is described Step 5:
The goodness of computation attribute mapping ensemblen;Detailed process are as follows:
At this point, attribute mapping cluster has been obtained.All properties in one mapping ensemblen form mutually semanteme and reflect
It penetrates, the information corresponding to them is a category information.Zhen Lingmin et al.[8](such as Zhen Lingmin, Yang Xiaochun, Wang Bin are based on attribute weight
Entity resolution technology [J] Journal of Computer Research and Development, 2013,50 (Suppl.): 281-289P ZHEN Lingmin, YANG
Xiaochun,WANG Bin,et al.Entity Analysis Technology Based on Attribute
Weight[J].Computer Research and Development,2013,50(Suppl.):281-289P)
Think, by assigning higher weight to some important attributes, some unessential attributes, Lai Zengjia entity can also be abandoned
The accuracy and efficiency of parsing.The goodness of computation attribute mapping ensemblen, corresponding above-mentioned weight.Carrying out subsequent entity resolution
When, it is weighted, improves its accuracy.
The goodness (good ()) of attribute mapping ensemblen: an attribute mapping ensemblen, attribute-name therein may be different, but corresponding
One category information.A category information pointed by attribute mapping ensemblen, corresponding attribute value information itself, the help to entity resolution
Size is known as the goodness of attribute mapping ensemblen.
The present invention passes through following three aspects computation attribute mapping ensemblen goodness:
A, discrimination property:
One attribute mapping ensemblen, if its value is more changeable, span is larger, can very little to the help of entity resolution.Attribute
The corresponding value of mapping ensemblen, changing in relatively a small range can be more useful to entity resolution.Set of records ends is let R be, for one
Attribute mapping ensemblen Ni, a discrimination goodness function is defined on R, is denoted as discr (Ni):
Wherein, val (r, Ni) be extracted and be recorded in attribute C in attribute mapping ensembleni(the attribute in attribute mapping ensemblen, from not
Same NameSpace, passes through calculating, it is assumed that they are directed to same category information.Weight, Lai Zengjia are assigned to this mapping ensemblen
The accuracy of entity resolution, discrimination property are exactly one of weight calculation method.If the attribute of this mapping ensemblen, corresponding category
Property value differ greatly, then it is assumed that the help of this attribute mapping ensemblen is smaller, they be directed toward a category informations probability it is relatively low, to assign
Lower weight is given, in order to avoid influence accuracy.) on attribute value, norm () be the attribute value of separate sources is standardized
Change;U is union symbol;
B, rich:
The value that attribute mapping ensemblen has is more, and the information provided for entity resolution is abundanter, that is to say, that a record
Attribute on this attribute mapping ensemblen, value is not sky, beneficial to entity resolution.Set of records ends is let R be, an attribute is reflected
Penetrate collection Ni, an abundant goodness function is defined on R, is denoted as abund (Ni):
Have at this timeΘ () representative function name, if record r is in property set NiOn have
Value, then this functional value is 1, and no value is then 0.
Distinguish property and it is rich sum up, be ranked up from big to small, by from big to small sequence calculate diversity;
C, diversity:
In order to increase diversity, and the repetition between different attribute mapping ensemblen is reduced, for there are the attributes of redundancy
Mapping ensemblen, it is possible to reduce its goodness.One attribute mapping ensemblen of every selection needs it and determines mapping ensemblen selected to use before
It is compared, if its information and mapping ensemblen before produce a large amount of repetition, reduces its diversity.If NiIt is current
Mapping ensemblen, NselectedFor the mapping cluster having been selected;For giving Nselected, the diversity of Ni is denoted as div (Ni|
Nselected):
Wherein, SiIt indicates by NiInstantiation set of records ends (if for attribute mapping ensemblen Ni, wherein have att1,
Att2 ..., attn } total n attribute, if record r includes one of attribute, and this attribute value be not it is empty, then claim attribute to reflect
It penetrates collection Ni and has instantiated this record r.), i.e.,SjIt indicates by NjThe set of records ends of instantiation, i.e.,Sv(vx,vy) it is vx、vySimilarity (using editing distance calculate or other suitable characters
Similarity function of going here and there calculates), vxIt is record r in NiUnder attribute value, vyIt is record r in NjUnder attribute value;div(Ni|Nj) table
Show and chooses NjAfter the calculating mapping ensemblen for calculating record similarity, whether N is chosen consideringiSimilar reflect is recorded as calculating
When penetrating collection, N is considerediInformation and NjInformation multiplicity;
Goodness is combined by two steps.The first step merges discrimination property and diversity, they reflect an attribute mapping ensemblen
Static goodness, goodness is updated in conjunction with diversity to the static goodness descending of attribute mapping ensemblen row's step.Second step combines multiplicity
Property updates goodness.If N is the attribute mapping cluster of sequence, for a mapping ensemblen Ni∈ N, mapping ensemblen NiWhole goodness is
good(Ni):
comb(Ni)=α discr (Ni)+(1-α)·abund(Ni)
Wherein, α is the weight of discrimination property goodness, and γ is the weight of static goodness, 0≤alpha, gamma≤1;comb(Ni) it is quiet
State property set goodness, incorporate discrimination property and it is rich.
Other steps and parameter are identical as one of specific embodiment one to seven.
Specific embodiment 9: unlike one of present embodiment and specific embodiment one to eight, the step 6
In obtain attribute mapping cluster in each mapping ensemblen goodness after, in block carry out entity resolution work, thus exclude accidentally be included in
This block is directed toward the data record of other entities;Calculating process is as follows:
There is record to ri、rj, r at this timeiThere are m attribute, rjThere is n attribute, wherein have p, and p by the attribute of mapping
≤ min (m, n), the attribute of mapping are { att1,att2,...,attp, riWith rjSimilarity are as follows:
Wherein, NlFor attribute mapping ensemblen, simcontent(ri.attl, rj.attl) it is ri、rjThe category of two record attribute mappings
Property value similarity;attlFor attribute corresponding to an attribute mapping ensemblen, this mapping ensemblen contains riWith rjIn a certain attribute reflect
It penetrates;The similarity that two record is compared with preset threshold value λ, if more than threshold value λ, then it is assumed that matching;Due to sequentially
Processing, so, when record is to matching, a new record is merged into, new record covers the information of former record pair;simcontent()
Calculation method and integration process be described in detail below.
The simcontent(ri.attl, rj.attl) specific solution procedure are as follows: the record information of class regular expression is whole
It closes
In the merging process of record, it can use one kind and be similar to regular expression method to close to information
It is all that a rule is determined to a category information and because this information merges and regular expression has some similarities.Merge
Clock synchronization is recorded, two values of attribute mapping are combined into a class regular expression.
Class regular expression concept
Due to the attribute of mapping, its value has similitude, and unification can be carried out to same section, and different piece is retained, with
This forms a class regular expression, can effectively merge match information.
Class regular expression (Similar to Regular Expression, StRE): ∑ is set as an alphabet, ε is
Null character.One class regular expression StRE=S [1] S [2] ... S [n].Wherein for any i (1≤i≤n), element S [i]
={ ci,1,ci,2,...,ci,ni, for j (1≤j≤ni), ci,j∈∑∪{ε}.Class regular expression is referred to as expressed below
Formula.
R is attribute value pair, and S is according to attribute value to the expression formula derived, and G is the set of S instantiation, at this time R ∈ G.
As class regular expression { t, n } ight can dissolve tight and night with example.
The expression formula of one attribute value is its value itself.And expression formula made of being merged as two attribute values or a category
Property value and expression formula merge made of class regular expression then need to make inferences calculating.
(1), the similarity of class regular expression is calculated;Detailed process are as follows:
The expression formula of one attribute value is itself, calculates two expression formula similarities, can use editing distance;Note editor
Distance function is D (i, j), and i, j respectively indicate the length of two character strings a and b, calculate editing distance using Dynamic Programming;
Obtain the editing distance similarity function sim between character stringedit(a,b):
The similarity an of expression formula and attribute value or the similarity of two expression formulas are calculated, with editing distance
It is similar.Since there may be multiple elements or null character in expression formula, so slightly changing to editing distance function, principle is changed
It is as follows: (1) for multiple characters, if containing identical characters in two expression elements, expression element matching;(2) right
In null character, if mismatched, the expression element comprising null character takes null character, does not account for length at this time, to editing distance without
It influences.
The smallest edit distance of two expression formulas is obtained by this above principle;Expression formula S1、S2Editing distance D (|
S1|,|S2|) it is as follows:
S [i] represents i-th of element of expression formula, S1[i] expression S1I-th of element, S2[j] expression
S2J-th of element, and have primary condition:
Wherein, the function MU principle of correspondence (1) matches as long as containing identical character between expression element;Principle (1)
For multiple characters, if containing identical characters in two expression elements, expression element matching;
The function NU principle of correspondence (2), null character if it exists in expression element, then directly ignored, not to editor away from
From calculating have an impact, editing distance can be maintained minimum;Principle (2) is for null character, includes null character if mismatched
Expression element take null character, length is not accounted at this time, on editing distance without influence.The editing distance of most latter two expression formula is
D(|S1|,|S2|);
At this point, the editing distance similarity between expression formula is simedit(S1,S2):
Wherein, simedit(S1,S2) it is simcontent(), S1With S2For function simcontent(ri.attl, rj.attl) in
Two attribute value attl;
In comparison merging process, regard the attribute value of record as one by one expression formula and calculate, comparing calculation is completed
Afterwards, attribute information is generated according to expression formula.
(2), the generation of class regular expression
New expression formula is generated according to two expression formulas, principle is to introduce least unrelated example.For example, attribute value pair
Cute kid and cut kind, then can have expression formula S1=cut { e, ε } ki { d, n } { d, ε }, S18 examples can be obtained, introduce
6 unrelated examples, can also there is expression formula S24 examples can be obtained in=cut { e, ε } ki { n, ε } d, introduce 2 unrelated realities
Example.So S2Better than S1。
The present invention utilize expression formula editing distance matrix M, from M [| S1|,|S2|] set out, M [0,0] is traced back to always, i.e.,
The expression formula of expression formula can be obtained, create-rule is as follows:
Particularly
Wherein, k is a certain position in trace-back process, and S [k] is the element of current location, and knot is recalled as i=j=0
Beam;At this point, expression formula S1、S2Be merged into class regular expression be S=...S [k] ... S [0] (wherein k by backtracking sequence inverted order
Arrangement).And in the case where three conditions of generation meet simultaneously, the priority of function MU is higher than function NU.Facilitate at this time
The merging of identical characters reduces the introducing of unrelated example.NU(S2[j]) it is expression formula S2The corresponding function of j-th of element, NU
(S1[i]) it is expression formula S1The corresponding function of i-th of element, MU (S1[i],S2[j]) it is expression formula S2J-th of element and table
Up to formula S1The corresponding function of i-th of element;
Illustrate the similarity calculation and generating process of class regular expression with an example.If attribute value
attvalue1=" cute kid ", attvalue2=" cut kind ".And this corresponding expression formula is respectively S1=cute
Kid, S2=cut kind.Distance matrix can be obtained according to the generation formula of class regular expression at this time, as shown in table 1.
The similarity calculation and generator matrix of 1 class regular expression of table
Wherein each value M [i, j] in matrix indicates S1Preceding i element and S2The editing distance of preceding j element, due to this
When expression formula be attribute value itself, so the editing distance of expression formula be equal to attribute value editing distance.S1With S2Editing distance
For M [7,7]=2, expression formula similarity is 1-2/7=5/7.Assuming that the similarity of two records has been more than threshold value at this time, it is true
Recognize matching, then needs to merge two attribute values.According to the create-rule of expression formula, on the basis of expression formula matrix M, from M [7,
7] start to recall.M [7,7]=M [6,6]+MU (S1[6],S2[6]), so tracing back to the position of [6,6] M, and so on backtracking.
Recall path overstriking in table 1 to mark.Finally obtain expression formula cut { e, ε } ki { ε, n } d.
(3), the generation of entity information
The attribute value in attribute mapping ensemblen is closed by Record Comparison step by step with information in block after treatment
And generate, finally obtain the final expression formula of an attribute mapping ensemblen.It can remember in the generating process of class regular expression
The frequency for recording the appearance of each element, using this class regular expression with frequency, in each class regular expression element
There is value of the maximum character of frequency as the element in selection, ultimately produces the highest character string of the frequency of occurrences as the category
The attribute value of property mapping ensemblen.Such as three attribute value Mike Doe, M.Doe, Mike D., obtain expression formula M { i:2 .:1 } k:2,
ε: 1 } { e:2, ε: 1 } D { o:2 .:1 } { e:2, ε: 1 }, finally obtains attribute value Mike Doe according to the frequency.
This type regular expression, for causing error character due to typesetting, spelling etc., this error character occurs
The frequency must be less than the character correctly spelt, so the maximum character of the frequency is taken to facilitate noise filtering.
Other steps and parameter are identical as one of specific embodiment one to eight.
Beneficial effects of the present invention are verified using following embodiment:
Embodiment one:
The present embodiment is specifically to be prepared according to the following steps:
Experimental data set
Piecemeal experiment uses two datasets, referred to as D1And D2。D1It is drawn from film common to DBPedia and IMDB
Message data set.D2It is drawn from the data set that the DBPedia of two versions is constituted[9](George Papadakis,
Jonathan Svirsky,Avigdor Gal,Themis Palpanas.Comparative analysis of
approximate blocking techniques for entity resolution[J].Proceedings of the
VLDB Endowment,2016,9(9):684-695P).Wherein D1There are 50796 records, includes 22403 entities, D2Include
335479 records, share 89258 entities.
Experimental evaluation standard
Herein using the evaluation criterion used[10](Chirag Nagpal,Kyle Miller,Benedikt
Boecking,Artur Dubrawski.An Entity Resolution approach to isolate instances
Of Human Trafficking online [J] .Computer Science, 2017,3 (18): 10-18P):
Striping criterion: to integrality (PC), cut rate (RR), F value (F=2 × PC × RR/ (PC+RR));
Parsing standard: accuracy rate (P), recall rate (R) and F value (F=2 × P × R/ (P+R)).
The experimental analysis of method of partition
In blocking process, the present invention describes four kinds of pruning methods, is by the parameter alpha setting value for calculating record similarity
0.6, then carry out beta pruning: for the threshold value beta pruning scheme (ECWP) of side centralization, in data set upper threshold value wmin0.6 is taken respectively,
When 0.5, F value highest;For the threshold value beta pruning scheme (NCWP) of node centralization, experiment is evenly distributed based on one kind it is assumed that making
With unified weight threshold, implementation procedure and threshold value are identical as ECWP;For the radix beta pruning scheme (ECCP) of side centralization,
When beta pruning, need to retain total k side, k is with sum | E | and change, take k=0.5* | E | when, F value reaches on both data sets
To highest;For the radix beta pruning scheme (NCCP) of node centralization, during beta pruning, to each node riThe side connected,
Retain kriItem is still based on is evenly distributed it is assumed that taking k at this timeri=0.5* | Eri| when, F value reaches highest.At this point, for four
Kind pruning method, takes it to show optimal threshold value, compares, as shown in table 2.
PC, RR value of 2 pruning method of table on both data sets
It can be seen that the pruning method of side centralization has cut more useless records pair, efficiency emphatically, be suitble to scale compared with
Big entity resolution task, especially in the less expection situation of matching entities;And for the pruning method of node centralization, it protects
Stay more matching record pairs.While centralization algorithm remain weight be top-k or weight be greater than preset threshold while, and tie
Dot center's algorithm ensures that each node is connected with its most like record, is more suitable for focusing on the application of accuracy;Weight
Thresholding algorithm remains the higher record pair of similarity, ensure that the accuracy of algorithm;Radix threshold value controls record to be compared
Pair quantity, remain the side of weight top-k, can have an impact to accuracy, but is guaranteed to method efficiency.
Whole accuracy is held in 97% or more, and economy is also maintained at 60% or more, illustrates to cut block figure
Branch, to a certain amount of record ignored pair is casted away, ensure that efficiency of algorithm to a certain extent.
The experimental analysis of attribute mapping and expression formula
In the mapping goodness of computation attribute cluster, parameter value is set as α=0.5, λ=0.4 by experiment.Choose the shortest distance
Method carries out Comparative result.Result assessments are carried out to two stages: first is that entity resolution as a result, but the life of real entities information
At.
For entity resolution as a result, the accuracys rate of two methods, recall rate as shown in Fig. 5 a, 5b.As can be seen that with
The rising of threshold value, recall rate decline, accuracy rate are substantially increased.On both data sets, when threshold value takes 0.6, context of methods reaches
Higher F value is arrived.On both data sets, the recall rate difference of two methods is little for method, and is mapped herein based on attribute
Make recall rate slightly good with the processing method of class regular expression.Accuracy rate is significantly better than the method based on the shortest distance, accurately
Rate is higher.Show that methods herein more adapts to the record feature of data space.And the shortest distance compared with rely on data semanteme, in language
Under the stronger environment of justice, bigger effect can be sent out.
For the generation phase of entity information, most of entity resolution work focuses on parsing result, and to entity
Information and the generation concern merged with entity be not high.Utilize data set D1, generation entity number and actual entities number on it is such as
Shown in Fig. 6.Entity generation is carried out using the optimal threshold generated during entity resolution.It can be seen from the figure that parsing herein
As a result more acurrate, it generates entity number and is closer to real entities number.But its accuracy rate is still significantly higher than knearest neighbour method.
The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field
Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to
The protection scope of the appended claims of the present invention.
Claims (9)
1. the entity resolution method in data-oriented space, it is characterised in that: the method detailed process are as follows:
Step 1: building record figure;
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.
2. the entity resolution method in data-oriented space according to claim 1, it is characterised in that: constructed in the step 1
Record figure;Detailed process are as follows:
Similarity between two step 1 one, calculating records;
Step 1 two constructs record figure according to the record and similarity of data space.
3. the entity resolution method in data-oriented space according to claim 2, it is characterised in that: the step 1 one is fallen into a trap
Calculate the similarity between two records;Detailed process are as follows:
Step 1 one by one, calculate label similarity:
Record is switched into tag set by label transfer function tag (), the label similarity of two records is calculated, is denoted as
simtag(ri,rj):
Wherein, T (ri) it is that r will be recorded by label transfer functioniThe standardization tally set being converted into;T(rj) it is to be turned by label
Exchange the letters number will record rjThe standardization tally set being converted into;
Step 1 one or two, calculated relationship similarity:
Two comprehensive similarities recorded in possessed all relationships are incorporated, sim is denoted asrel(ri,rj):
Wherein, Nbr (ri) indicate and record riThere are the set of records ends of connection, Nbr (r in rel relationshipj) indicate and record rj?
There is the set of records ends of connection in rel relationship, REL is indicated in record r1、r2All record set of relationship of upper appearance;
Step 1 one or three integrates label similarity and relationship similarity, obtains comprehensive similarity sim (ri,rj):
sim(ri,rj)=α simtag(ri,rj)+(1-α)·simrel(ri,rj)
Wherein, α indicates the weight of label similarity.
4. the entity resolution method in data-oriented space according to claim 3, it is characterised in that: root in the step 1 two
According to record and similarity the building record figure of data space;Detailed process are as follows:
The set of records ends R of a given data space, constructs a non-directed graph G=(R, E), referred to as record figure;
Wherein R is set of records ends, represents the record in data space;E is side collection, represents and remembers there are a line between two records
The similarity of record pair.
5. the entity resolution method in data-oriented space according to claim 4, it is characterised in that: used in the step 2
Pruning method simplifies record figure;Detailed process are as follows:
Pruning method use while centralization radix beta pruning, node centralization radix beta pruning, while centralization threshold value beta pruning or
One of threshold value beta pruning of node centralization;
(1) the radix beta pruning of side centralization:
Global radix threshold value k, which specifies record figure, will retain the sum on side, retain the side of k maximum weight;
(2) the radix beta pruning of node centralization:
For each node ri, retain link node riTop-k weight side;
(3) the threshold value beta pruning of side centralization:
Beta pruning is carried out in global scope using weight threshold, chooses minimum side right wmin, weight is lower than by all sides in traversing graph
wminEdge contract;
(4) the threshold value beta pruning of node centralization:
One unified threshold value, the weight beta pruning scheme phase of beta pruning process and side centralization are used to all nodes of record figure
Together.
6. the entity resolution method in data-oriented space according to claim 5, it is characterised in that: to cutting in the step 3
Record figure after change carries out piecemeal processing;Detailed process are as follows:
Figure after beta pruning is G={ R, E }, and R is set of records ends, and E is side collection;
Appoint and takes a record ri, create a block biAnd by riIt is placed in biIn, if riNode r in point regionjWith biIn all knots
Point has Bian Xianglian, then by rjIt is placed in biIn, and delete rjWith biIn all nodes connected side, repeat this operation until time
Go through riAll neighbor nodes;At this point, if biIn node become an isolated node in figure, it is boundless to be attached thereto, then from
This node is deleted in figure;Step 3 is repeated until figure G is sky;At this point, piecemeal work is completed, set of blocks B={ b is obtained1,
b2,...,b|B|}。
7. the entity resolution method in data-oriented space according to claim 6, it is characterised in that: established in the step 4
Attribute maps cluster;Detailed process are as follows:
Step 4 one, for two attributes from different entities, the similar value of computation attribute:
The similar value of attribute, is denoted as SV, all values of two attributes are compared, retains highest similarity score, obtains attributes match
It is right;
If I is NameSpace number, J is matched attributes match logarithm, the NameSpace n different for twos、nt, ns、ntUnder
Attribute number is denoted as Ms、Mt, nsUnder a-th of attribute, ntUnder b-th of attribute be denoted as p respectivelya s、pb t, by maximizing all
Properties pair of total matching probability matches to optimize integrity attribute, meets global 1:1 limitation with this:
Wherein σ (pa s,pb t) it is attribute to pa sAnd pb tMatching probability;Θ(pa s,pb t) it is indicator function, when selection attribute pair
pa s、pb tCome when forming an attribute mapping ensemblen, otherwise functional value 1 is 0;
Step 4 two is arranged attributes match by matching probability descending set, is sequentially handled, for attribute to pa、pbIf
P is not separately included in attribute mapping cluster Na、pbAttribute mapping ensemblen Ni、Nj, then attribute mapping ensemblen { p is added into Na、pb};
If it includes p in cluster N that attribute, which maps,aAttribute mapping ensemblen NiAnd comprising coming from and p in attribute mapping cluster NbSame NameSpace
Attribute, delete pa≈pbThis attribute pair;Otherwise, merge Ni、NjThe attribute mapping ensemblen N bigger as onek, by NkIt is added to
In N, while N is deleted from Ni、Nj;Step 4 two is repeated until J is sky.
8. the entity resolution method in data-oriented space according to claim 7, it is characterised in that: calculated in the step 5
The goodness of attribute mapping ensemblen;Detailed process are as follows:
Pass through following three aspects computation attribute mapping ensemblen goodness:
A, discrimination property:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, a discrimination goodness function is defined on R, is denoted as discr
(Ni):
Wherein, val (r, Ni) be extracted and be recorded in attribute C in attribute mapping ensembleniOn attribute value, norm () be to separate sources
Attribute value standardize;U is union symbol;
B, rich:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, an abundant goodness function is defined on R, is denoted as abund
(Ni):
Have at this timeΘ () representative function name, if record r is in property set NiOn have value,
Then this functional value is 1, and no value is then 0;
Distinguish property and it is rich sum up, be ranked up from big to small, by from big to small sequence calculate diversity;
C, diversity:
If NiFor current mapping ensemblen, NselectedFor the mapping cluster having been selected;For giving Nselected, NiDiversity note
For div (Ni|Nselected):
Wherein, SiIt indicates by attribute mapping ensemblen NiThe set of records ends of instantiation, i.e.,SjIndicate by
Attribute mapping ensemblen NjThe set of records ends of instantiation, i.e.,Sv(vx,vy) it is vx、vySimilarity, vx
For attribute value of the record r at Ni, vyIt is record r in NjUnder attribute value;div(Ni|Nj) indicate to choose NjIt is recorded as calculating
After the calculating mapping ensemblen of similarity, whether N is chosen consideringiWhen recording similar mapping ensemblen as calculating, N is considerediInformation and
NjInformation multiplicity;
If N is the attribute mapping cluster of sequence, for a mapping ensemblen Ni∈ N, mapping ensemblen NiWhole goodness is good (Ni):
comb(Ni)=α discr (Ni)+(1-α)·abund(Ni)
Wherein, α is the weight of discrimination property goodness, and γ is the weight of static goodness, 0≤alpha, gamma≤1;comb(Ni) it is static belong to
Property collection goodness, incorporate discrimination property and it is rich.
9. the entity resolution method in data-oriented space according to claim 8, it is characterised in that: obtained in the step 6
In attribute mapping cluster after the goodness of each mapping ensemblen, entity resolution is carried out in block;Calculating process is as follows:
There is record to ri、rj, r at this timeiThere are m attribute, rjThere is n attribute, wherein have p, and p≤min by the attribute of mapping
(m, n), the attribute of mapping are { att1,att2,...,attp, riWith rjSimilarity are as follows:
Wherein, NlFor attribute mapping ensemblen, simcontent(ri.attl,rj.attl) it is ri、rjThe attribute value of two record attribute mappings
Similarity;attlFor attribute corresponding to an attribute mapping ensemblen;
The similarity that two record is compared with preset threshold value λ, if more than threshold value λ, then it is assumed that matching;Due to sequentially
Processing, so, when record is to matching, a new record is merged into, new record covers the information of former record pair.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910435269.5A CN110147393B (en) | 2019-05-23 | 2019-05-23 | Entity analysis method for data space in movie information data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910435269.5A CN110147393B (en) | 2019-05-23 | 2019-05-23 | Entity analysis method for data space in movie information data set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110147393A true CN110147393A (en) | 2019-08-20 |
CN110147393B CN110147393B (en) | 2021-08-13 |
Family
ID=67593022
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910435269.5A Expired - Fee Related CN110147393B (en) | 2019-05-23 | 2019-05-23 | Entity analysis method for data space in movie information data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147393B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111221995A (en) * | 2019-10-10 | 2020-06-02 | 南昌市微轲联信息技术有限公司 | Sequence matching method based on big data and set theory |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110145286A1 (en) * | 2009-12-15 | 2011-06-16 | Chalklabs, Llc | Distributed platform for network analysis |
CN102646137A (en) * | 2012-04-19 | 2012-08-22 | 中国人民解放军总参谋部第六十三研究所 | Automatic entity basic information generation system and method based on Markov model |
CN106202502A (en) * | 2016-07-20 | 2016-12-07 | 福州大学 | In music information network, user interest finds method |
-
2019
- 2019-05-23 CN CN201910435269.5A patent/CN110147393B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110145286A1 (en) * | 2009-12-15 | 2011-06-16 | Chalklabs, Llc | Distributed platform for network analysis |
CN102646137A (en) * | 2012-04-19 | 2012-08-22 | 中国人民解放军总参谋部第六十三研究所 | Automatic entity basic information generation system and method based on Markov model |
CN106202502A (en) * | 2016-07-20 | 2016-12-07 | 福州大学 | In music information network, user interest finds method |
Non-Patent Citations (1)
Title |
---|
甄灵敏等: "基于属性权重的实体解析技术", 《计算机研究与发展》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111221995A (en) * | 2019-10-10 | 2020-06-02 | 南昌市微轲联信息技术有限公司 | Sequence matching method based on big data and set theory |
CN111221995B (en) * | 2019-10-10 | 2023-10-03 | 南昌市微轲联信息技术有限公司 | Sequence matching method based on big data and set theory |
Also Published As
Publication number | Publication date |
---|---|
CN110147393B (en) | 2021-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101093559B (en) | Method for constructing expert system based on knowledge discovery | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN107103050A (en) | A kind of big data Modeling Platform and method | |
CN107193967A (en) | A kind of multi-source heterogeneous industry field big data handles full link solution | |
WO2020010834A1 (en) | Faq question and answer library generalization method, apparatus, and device | |
CN109033303A (en) | A kind of extensive knowledge mapping fusion method based on reduction anchor point | |
WO2021128158A1 (en) | Method for disambiguating between authors with same name on basis of network representation and semantic representation | |
CN105706092B (en) | The method and system of four values simulation | |
CN107193882A (en) | Why not query answer methods based on figure matching on RDF data | |
CN112559766A (en) | Legal knowledge map construction system | |
CN113254630A (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
CN111930892B (en) | Scientific and technological text classification method based on improved mutual information function | |
CN113487024A (en) | Alternate sequence generation model training method and method for extracting graph from text | |
WO2024051000A1 (en) | Structured simulation data generating system and generating method | |
CN111858962A (en) | Data processing method, device and computer readable storage medium | |
CN112115971A (en) | Method and system for portraying scholars based on heterogeneous academic network | |
CN109189941A (en) | For updating the method, apparatus, equipment and medium of model parameter | |
CN115329210A (en) | False news detection method based on interactive graph layered pooling | |
CN111897911B (en) | Unstructured data query method and system based on secondary attribute graph | |
CN110147393A (en) | The entity resolution method in data-oriented space | |
Kobren et al. | Integrating user feedback under identity uncertainty in knowledge base construction | |
CN116266189A (en) | Robot question-answering system based on fusion map | |
Estevez-Velarde et al. | Demo application for leto: Learning engine through ontologies | |
Hirchoua et al. | Topic hierarchies for knowledge capitalization using hierarchical Dirichlet processes in big data context | |
Rizzo et al. | Inductive classification through evidence-based models and their ensembles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210813 |
|
CF01 | Termination of patent right due to non-payment of annual fee |