CN110147393A - The entity resolution method in data-oriented space - Google Patents

The entity resolution method in data-oriented space Download PDF

Info

Publication number
CN110147393A
CN110147393A CN201910435269.5A CN201910435269A CN110147393A CN 110147393 A CN110147393 A CN 110147393A CN 201910435269 A CN201910435269 A CN 201910435269A CN 110147393 A CN110147393 A CN 110147393A
Authority
CN
China
Prior art keywords
attribute
record
similarity
mapping
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910435269.5A
Other languages
Chinese (zh)
Other versions
CN110147393B (en
Inventor
周连科
赵昱杰
张毅
苏畅
王红滨
王念滨
崔琎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201910435269.5A priority Critical patent/CN110147393B/en
Publication of CN110147393A publication Critical patent/CN110147393A/en
Application granted granted Critical
Publication of CN110147393B publication Critical patent/CN110147393B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The entity resolution method in data-oriented space, the present invention relates to entity resolution methods.The purpose of the present invention is to solve existing in data space when carrying out entity resolution, to compare to record, for the record pair of different field, matching probability very little, paired comparison can waste of resource the problem of.Process are as follows: Step 1: building record figure: Step 2: simplifying record figure using pruning method;Step 3: carrying out piecemeal processing to the record figure after cutting;Step 4: establishing attribute mapping cluster;Step 5: the goodness of computation attribute mapping ensemblen;Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.The present invention parses field for data entity.

Description

The entity resolution method in data-oriented space
Technical field
The present invention relates to entity resolution methods.
Background technique
Entity resolution refers to the process of the different description forms for identifying same entity, it is intended to ensure the quality of data, be data Key technology in cleaning, data integration and data mining[1](Vasilis Efthymiou,Kostas Stefanidis, Vassilis Christophides.Big Data Entity Resolution:From Highly to Somehow Similar Entity Descriptions in the Web[C]//Proceeding Big Data’15Proceedings of the 2015IEEE International Conference on Big Data,2015,11(1):401-410P).? In traditional entity resolution work, most of work is dependent on the mode or Semantic mapping between data.Data space is a kind of New data integration mode, its not stringent data pattern and Semantic mapping, but according to the demand of main body gradually by data It is included in simultaneously opening relationships, is a kind of heterogeneous data set, its main feature is that data come from multiple data sources[2](Ge Jingjun, Hu Changjun, The small-sized microcomputer system of virtual data space Share Model [J] of Liu Xin domain-oriented science, 2014,35 (3): 514- 519PGE Jingjun,HU Changjun,LIU Xin.Virtual Data Space Sharing Model for Domain Science[J].Minicomputer System,2014,35(13):514-519P).It is carried out in data space When entity resolution, the powerful of entity resolution, Semantic mapping are just lost.Entity resolution will compare record, for The record pair of different field, matching probability very little, paired comparison can waste of resource.
Summary of the invention
The purpose of the present invention is to solve existing in data space when carrying out entity resolution, to carry out to record pair Than, for the record pair of different field, matching probability very little, paired comparison can waste of resource the problem of, and propose data-oriented The entity resolution method in space.
The entity resolution method detailed process in data-oriented space are as follows:
Step 1: building record figure:
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.
The invention has the benefit that
The invention proposes partitions[3](Batya Kening,Avigdor Gal.MFIBlocks:An effective blocking algorithm for entity resolution[J].Information Systems, 2013,38 (6): 908-926P), i.e., data are prejudged using a kind of cost lower calculation method, it is possible to belong to same The data record of one entity is put in one block, and Record Comparison is only carried out in block.It solves and existing is carried out in data space When entity resolution, record is compared, for the record pair of different field, matching probability very little, paired comparison can be wasted The problem of resource.
This paper data-oriented space parses multi-resources Heterogeneous data entity and carries out theoretical research.Even if in view of in no semanteme It under mapping, is directed toward two of same entity and is recorded on its attribute value and also interosculate, and the relationship between record is included in It calculates, both comprehensive building record figure.Record is simplified by its applicable pruning method for the set of records ends of different situations Figure, and propose the algorithm that piecemeal is carried out according to the record figure after beta pruning.
When doing entity resolution in block, attribute is done using attribute value and is mapped, by obtaining overall data record in block The information that attribute-name is referred to, will have common value with available data in block but still unmatched data separation comes, and propose one Kind be similar to regular expression method, the similarity of computation attribute value, and to matching record attribute mapping attribute value into Row merges, to return to user one more comprehensive entity information.
By experimental verification, method proposed in the present invention has entity resolution certain positive impetus.
Detailed description of the invention
Fig. 1 is that present invention building records map flow chart;
Fig. 2 is that the present invention schemes record to carry out beta pruning flow chart;
Fig. 3 is to carry out piecemeal flow chart according to the record figure after beta pruning;
Fig. 4 a is the datagram of heterogeneous attribute mapping;
Fig. 4 b is the global property mapping graph of heterogeneous attribute mapping;
Fig. 4 c is that the attribute of heterogeneous attribute mapping maps the goodness calculating figure of cluster;
Fig. 5 a is on both data sets that two methods are with figure of changing therewith;
Fig. 5 b is on both data sets that two methods are with figure of changing therewith;
Fig. 6 is that the entity of two kinds of algorithms generates comparison diagram.
Specific embodiment
Specific embodiment 1: the entity resolution method detailed process in embodiment of the present invention data-oriented space are as follows:
Step 1: building record figure:
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: the progress entity resolution work in block in attribute mapping cluster after the goodness of each mapping ensemblen is obtained, from And exclude accidentally to be included in this block, it is directed toward the data record of other entities.
Specific embodiment 2: the present embodiment is different from the first embodiment in that, note is constructed in the step 1 Record figure;Detailed process are as follows:
Data record is indicated using a kind of stamp methods, regards data record as a property value set;It is based on one at this time Kind common-sense is assumed[4]([4]S.Prabhakar Benny,S.Vasavi Dr,P.Anupriya.Hadoop Framework For Entity Resolution Within High Velocity Streams[J].Procedia Computer Science, 2016,85:550-557P), if two records are directed toward same entity, they necessarily include some identical categories Property value.And count the relationship between data record, improve accuracy[5](Xiao Qihua, Chen Ke, Huang Dongmei consider space phase Data space feature extraction method [J] Computer Simulation of closing property, 2014,31 (12): 425-428,433P XIAO Qihua,CHEN Ke,HUANG Dongmei.Data Spatial Feature Extraction Method Considering Spatial Relevance[J].Computer Simulation,2014,31(12):425-428, 433P).Graph model is recorded using one to indicate the record node and record node relationship in data space;By calculating two Similarity between record draws a line between two records, and side right is similarity value.The piecemeal side of this label style Method, since its representation method is simple, it is only necessary to obtain the attribute value of record, and not depend on fixed data pattern and strong penetrate Semanteme, so can have very powerful applicability on the heterogeneous data collection in face of data space.
Equipped with set of records ends R={ r1{FullName:Tom Lloyd Malik;Job:producer,Actor; Address:L.A.},r2{Name:Tom Malik;Producer;birthPlace:L.A.},r3{Label:Mike Styles;Profession:producer;Place_of_birth:L.A.;Place_of_birth:1964},r4{Mike Harry Styles;birthPlace:L.A.;Gender:male},r5{FullName:Harry Green;Address: LOS;Sex:male;Profession:Writher},r6{Label:Harry Green;Gender:male;birthYear: 1980;married}}.Record figure method of partition general view based on set of records ends R is as shown in Figure 1, Figure 2, Figure 3 shows (in order to which simplification is shown Example diagram, set of records ends wouldn't mark their relationship in figure).
Similarity between two step 1 one, calculating records;
Step 1 two constructs record figure according to the record and similarity of data space.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 3: the present embodiment is different from the first and the second embodiment in that, in the step 1 one Calculate the similarity between two records;Detailed process are as follows:
One record can be converted to a tally set (i.e. tag:r by label transfer function tag ()i→T(ri)).It calculates The label similarity of two records can be obtained in the ratio of two set intersections and union size.
Step 1 one by one, calculate label similarity:
Record is switched into tag set by label transfer function tag (), the label similarity of two records is calculated, is denoted as simtag(ri,rj):
Wherein, T (ri) it is that r will be recorded by label transfer functioniThe standardization tally set being converted into;T(rj) it is to pass through mark Label transfer function will record rjThe standardization tally set being converted into;
Step 1 one or two, calculated relationship similarity:
Two comprehensive similarities recorded in possessed all relationships are incorporated, sim is denoted asrel(ri,rj):
Wherein, Nbr (ri) indicate and record riThere are the set of records ends of connection, Nbr (r in rel relationshipj) indicate and record rjThere is the set of records ends of connection in rel relationship, REL is indicated in record r1、r2All record set of relationship of upper appearance, such as Cooperative relationship, teacher-student relationship, relationship etc. of teaching;
Step 1 one or three integrates label similarity and relationship similarity, obtains comprehensive similarity sim (ri,rj):
sim(ri, rj)=α simtag(ri, rj)+(1-α)·simrel(ri, rj)
Wherein, α indicates the weight of label similarity.
Other steps and parameter are the same as one or two specific embodiments.
Specific embodiment 4: unlike one of present embodiment and specific embodiment one to three, the step 1 According to the record of data space and similarity building record figure in two;Detailed process are as follows:
The set of records ends R of a given data space, constructs a non-directed graph G=(R, E), referred to as record figure;
Wherein R is set of records ends, represents the record in data space;E be side collection, two record between there are a line generations The similarity of table record pair.
Record figure building after the completion of, scheme in side right it is lesser record it is lower to matching probability, by record figure side into Row processing, reduce unnecessary Record Comparison compared with.
Other steps and parameter are identical as one of specific embodiment one to three.
Specific embodiment 5: unlike one of present embodiment and specific embodiment one to four, the step 2 It is middle that record figure is simplified using pruning method;Detailed process are as follows:
According to certain lesser side of redundant rule elimination weight, i.e., cut operator is made to figure, reduce the matching times of redundancy.
For the ease of next describing, the definition in region is provided.
Define 4. regions: the form that a record r occurs in record figure G is a node, this record node itself, Have therewith while neighbours' record and connect their the while region that is constituted and become a point region.Point region is one in record figure G A subgraph, is denoted as Gr={ { r } ∪ Rr,Er}.Wherein, RrFor the set that the neighbor node for having side to be connected with r is constituted, ErFor connection The set that the side of r and neighbor node is constituted.
Mainly there are two component parts for beta pruning process: beta pruning center and prune rule.
Beta pruning center can be divided into two classes: side centralization, be selected by the side collection of traversing graph global optimal to be compared It is right, the side for being unsatisfactory for prune rule is filtered out with this;Node centralization, all nodes in its traversing graph, it is intended to for one A record node is found best to be compared to collection with it in its point region --- that is, the side right being connected with this record is maximum Several record.
Prune rule is classified as weight threshold and radix threshold value according to function.Weight threshold, which specifies, retains side most Small weight deletes all sides lower than this weight;Radix threshold value gives the Maximum edge numbers for retaining side in figure, retains side right top- The side of k.Wherein, radix threshold value defines to be compared pair of quantity, is suitable for having to the conditional application of time resource.Weight Threshold value decides whether to trim the side of linkage record pair according to the matching probability that record has itself, suitable for valuing validity Application program.Prune rule is divided into global threshold and local threshold according to sphere of action.Global threshold is suitable for entire figure, Namely side all in figure;And local threshold is suitable for a subset of figure, i.e., the point region of one node.
Pruning method use while centralization radix beta pruning, node centralization radix beta pruning, while centralization threshold value cut One of branch or the threshold value beta pruning of node centralization;
(1) the radix beta pruning of side centralization:
Global radix threshold value k, which specifies record figure, will retain the sum on side, retain the side of k maximum weight;It can be according to The low side of weight is effectively deleted in weight opposite side collection descending sort.
(2) the radix beta pruning of node centralization:
For each node ri, retain link node riTop-k weight k of maximum weight (be exactly while) while;For Node ri, record node riPoint region Gri={ { ri}∪Rri,EriAnd current record node radix threshold value kri, traverse subgraph In side, retain weight top-k (total n side connect with ri, reservation the highest k side of weight, it is other to delete) while Other connections r is deleted simultaneouslyiSide;
Wherein, RriFor with record node riThe set that the neighbor node for having side connected is constituted, EriFor linkage record node riThe set constituted with the side of neighbor node;Point region GriFor a subgraph in record figure G;
In general, the radix threshold value of each node should depend on side collection size (such as k in its regionri=0.1 × | Eri |)。
(3) the threshold value beta pruning of side centralization:
Beta pruning is carried out in global scope using weight threshold, chooses minimum side right wmin(experiment obtains) owns in traversing graph Weight is lower than w by sideminEdge contract;
It has traversed side all in figure and has deleted weight lower than preset threshold wminSide, remaining side is simultaneously in reserved graph Output.Under normal circumstances, the weight between matching record is greater than the weight mismatched between record, therefore, chooses wminTarget is just It is the equalization point determined between the two.
(4) the threshold value beta pruning of node centralization:
Selection to beta pruning range then uses a unified threshold to all nodes of record figure according to global threshold Value, beta pruning process is identical as the weight beta pruning scheme of side centralization at this time;It, then can be according to user's need according to local threshold It wants, chooses a specific threshold for special node, substantially, it applies the weight beta pruning of side centralization in node riPoint On region.Main difference with the threshold value beta pruning scheme of side centralization is that it can use different threshold values for each node. It obtains r firstiPoint region Gri={ { ri}∪Rri,Eri, subgraph is then specified according to the local threshold standard of input and is cut The minimum side right of branch;Then, E is traversedriIn side, by weight be less than given threshold edge contract.
Other steps and parameter are identical as one of specific embodiment one to four.
Specific embodiment 6: unlike one of present embodiment and specific embodiment one to five, the step 3 In to after cutting record figure carry out piecemeal processing;Detailed process are as follows:
Figure after beta pruning is G={ R, E }, and R is set of records ends, and E is side collection;
Appoint and takes a record ri, create a block biAnd by riIt is placed in biIn, if riNode r in point regionjWith biIn All nodes have Bian Xianglian, then by rjIt is placed in biIn, and delete rjWith biIn all nodes connected side, repeat this operation Until traversal riAll neighbor nodes;At this point, if biIn node become an isolated node, boundless phase therewith in figure Even, then this node is deleted from figure;Step 3 is repeated until figure G is sky;At this point, piecemeal work is completed, set of blocks B=is obtained {b1,b2,...,b|B|};
Record Comparison and connection based on attribute mapping
By piecemeal, the record information in block includes a large amount of duplicate attribute values, helps to map attribute.Pass through The characteristics of observing global data and the processing for making Semantic mapping, it can be found that wherein the information of one or several data record has Some different places can divide away this record to improve accuracy at this time.Matched record information is integrated, is used Needed for family is asked for, the time of processing record information can be reduced.
The mapping of heterogeneous attribute
Common method is using unified attribute come computation attribute value similarity in entity resolution.But in data space In the environment of this multi-resources Heterogeneous, mapped without accurate attribute, it is possible to carry out attribute using attribute value in turn Match.
Instantiation: an entity record is made of one group of attribute and value, and the value of an attribute, it is understood that there may be mistake is spelled It writes or null value.Null value can not be compared.If an entity claims as described by a value not attribute for sky This entity of this attribute instanceization.Entire attribute semantemes mapping process is as shown in Fig. 4 a, 4b, 4c.
Other steps and parameter are identical as one of specific embodiment one to five.
Specific embodiment 7: unlike one of present embodiment and specific embodiment one to six, the step 4 In establish attribute mapping cluster;Detailed process are as follows:
Step 4 one, for two attributes from different entities, the similar value of computation attribute:
1) attribute-name similarity (can use editing distance or other suitable similarity of character string flow functions), is denoted as SL, obtained by the attribute-name after two standardization of comparison (according to data set feature, can specification capital and small letter, extension is referred to as Full name);
Can be according in data set, the size of attribute-name similarity, to decide whether for this part to be included in calculating.
2) (two attributes, they may have different values to attribute value similarity on different records, as attribute att1 exists There are 3 values in set of records ends, attribute att2 there are 4 values in set of records ends, then chooses two most like values of att1 and att2 Attribute value similarity as two attributes), it is denoted as SV, all values of two attributes are compared, and retain highest similarity and obtain Point.
Obtain attributes match pair based on attribute-name similarity and attribute value similarity based method, dependence matching in set, Attributes match collection is obtained by calculation, the collection of attributes match collection is collectively referred to as attributes match cluster.
The attribute that attributes match is concentrated exactly matches between each other.It includes many loose that method, which refuses an attribute mapping ensemblen, Relevant attribute, and follow widely used without repetition hypothesis[6](Imen Megdiche,Oliver Teste,Cassia Trojahn.An extensible linear approach for holistic ontology matching[C].In ISWC, Part I, vol.LNCS 9981.Springer, Kobe, Japan, 2016:393-410P), a title is limited at this time Each attribute under space (data source that an entity or one have semantic restriction) can at most match another name space Under an attribute, be arranged overall situation 1:1 matching constraint[7](Chuncheng Xiang,Baobao Chang,Zhifang Sui.An ontology matching approach based on affinity-preserving random walks [C]//Proceeding IJCAI'15Proceedings of the 24th International Conference on Artificial Intelligence,2015:1471-1477P).However, attribute mapping ensemblen under overall situation 1:1 matching constraint Derivation is not a simple process, because an attribute frequently involves more than one matched attribute pair, and is simply chosen With highest matching probability estimation to may result in conflict.
The process for obtaining entire attribute mapping cluster is known as global property mapping, it cooperates collection an attributes match It is inputted for one, returns to an attribute mapping cluster.If I is NameSpace number, J is matched attributes match logarithm, for two A different NameSpace ns、nt, ns、ntUnder attribute number be denoted as Ms、Mt, nsUnder a-th of attribute, ntUnder b-th of attribute point P is not denoted as ita s、pb t, optimize integrity attribute matching by maximizing total matching probability of all match attributes pair, met with this Global 1:1 limitation:
Wherein σ (pa s,pb t) it is attribute to pa sAnd pb tMatching probability;Θ(pa s,pb t) it is indicator function, when selection attribute To pa s、pb tCome when forming an attribute mapping ensemblen, otherwise functional value 1 is 0.
For being formed shown in the algorithm 1 of attributes match cluster by match attribute to set:
Attributes match is arranged (line 2) by matching probability descending to set by step 4 two, is sequentially handled, for attribute To pa、pbIf not separately including p in attribute mapping cluster Na、pbAttribute mapping ensemblen Ni、Nj, then attribute is added into N reflect Penetrate collection { pa、pb}(line 6-7);If it includes p in cluster N that attribute, which maps,aAttribute mapping ensemblen NiAnd it is wrapped in attribute mapping cluster N Containing coming from and pb(data source is exactly a NameSpace to the attribute of same NameSpace, if which uncertain record comes from A data source, then this records oneself itself is a NameSpaces.Attribute with pb from same NameSpace, that is, It says, this attribute and pb belong to a record, are one and record two attributes for including.), delete pa≈pbThis attribute pair (line 8-9);Otherwise (if for above-mentioned two negative, first if it is saying, if do not had in attribute mapping cluster N Separately include pa、pbAttribute mapping ensemblen Ni、NjIf second if it is say attribute mapping cluster N in exist comprising paAttribute Mapping ensemblen NiAnd wherein comprising coming from and p in attribute mapping cluster NbThe attribute of same NameSpace.It otherwise is exactly both of these case All ungratified situation.), merge Ni、NjThe attribute mapping ensemblen N bigger as onek, by NkIt is added in N, while being deleted from N Except Ni、Nj(line 11-12);Step 4 two is repeated until J is sky.
Other steps and parameter are identical as one of specific embodiment one to six.
Specific embodiment 8: unlike one of present embodiment and specific embodiment one to seven, it is described Step 5: The goodness of computation attribute mapping ensemblen;Detailed process are as follows:
At this point, attribute mapping cluster has been obtained.All properties in one mapping ensemblen form mutually semanteme and reflect It penetrates, the information corresponding to them is a category information.Zhen Lingmin et al.[8](such as Zhen Lingmin, Yang Xiaochun, Wang Bin are based on attribute weight Entity resolution technology [J] Journal of Computer Research and Development, 2013,50 (Suppl.): 281-289P ZHEN Lingmin, YANG Xiaochun,WANG Bin,et al.Entity Analysis Technology Based on Attribute
Weight[J].Computer Research and Development,2013,50(Suppl.):281-289P) Think, by assigning higher weight to some important attributes, some unessential attributes, Lai Zengjia entity can also be abandoned The accuracy and efficiency of parsing.The goodness of computation attribute mapping ensemblen, corresponding above-mentioned weight.Carrying out subsequent entity resolution When, it is weighted, improves its accuracy.
The goodness (good ()) of attribute mapping ensemblen: an attribute mapping ensemblen, attribute-name therein may be different, but corresponding One category information.A category information pointed by attribute mapping ensemblen, corresponding attribute value information itself, the help to entity resolution Size is known as the goodness of attribute mapping ensemblen.
The present invention passes through following three aspects computation attribute mapping ensemblen goodness:
A, discrimination property:
One attribute mapping ensemblen, if its value is more changeable, span is larger, can very little to the help of entity resolution.Attribute The corresponding value of mapping ensemblen, changing in relatively a small range can be more useful to entity resolution.Set of records ends is let R be, for one Attribute mapping ensemblen Ni, a discrimination goodness function is defined on R, is denoted as discr (Ni):
Wherein, val (r, Ni) be extracted and be recorded in attribute C in attribute mapping ensembleni(the attribute in attribute mapping ensemblen, from not Same NameSpace, passes through calculating, it is assumed that they are directed to same category information.Weight, Lai Zengjia are assigned to this mapping ensemblen The accuracy of entity resolution, discrimination property are exactly one of weight calculation method.If the attribute of this mapping ensemblen, corresponding category Property value differ greatly, then it is assumed that the help of this attribute mapping ensemblen is smaller, they be directed toward a category informations probability it is relatively low, to assign Lower weight is given, in order to avoid influence accuracy.) on attribute value, norm () be the attribute value of separate sources is standardized Change;U is union symbol;
B, rich:
The value that attribute mapping ensemblen has is more, and the information provided for entity resolution is abundanter, that is to say, that a record Attribute on this attribute mapping ensemblen, value is not sky, beneficial to entity resolution.Set of records ends is let R be, an attribute is reflected Penetrate collection Ni, an abundant goodness function is defined on R, is denoted as abund (Ni):
Have at this timeΘ () representative function name, if record r is in property set NiOn have Value, then this functional value is 1, and no value is then 0.
Distinguish property and it is rich sum up, be ranked up from big to small, by from big to small sequence calculate diversity;
C, diversity:
In order to increase diversity, and the repetition between different attribute mapping ensemblen is reduced, for there are the attributes of redundancy Mapping ensemblen, it is possible to reduce its goodness.One attribute mapping ensemblen of every selection needs it and determines mapping ensemblen selected to use before It is compared, if its information and mapping ensemblen before produce a large amount of repetition, reduces its diversity.If NiIt is current Mapping ensemblen, NselectedFor the mapping cluster having been selected;For giving Nselected, the diversity of Ni is denoted as div (Ni| Nselected):
Wherein, SiIt indicates by NiInstantiation set of records ends (if for attribute mapping ensemblen Ni, wherein have att1, Att2 ..., attn } total n attribute, if record r includes one of attribute, and this attribute value be not it is empty, then claim attribute to reflect It penetrates collection Ni and has instantiated this record r.), i.e.,SjIt indicates by NjThe set of records ends of instantiation, i.e.,Sv(vx,vy) it is vx、vySimilarity (using editing distance calculate or other suitable characters Similarity function of going here and there calculates), vxIt is record r in NiUnder attribute value, vyIt is record r in NjUnder attribute value;div(Ni|Nj) table Show and chooses NjAfter the calculating mapping ensemblen for calculating record similarity, whether N is chosen consideringiSimilar reflect is recorded as calculating When penetrating collection, N is considerediInformation and NjInformation multiplicity;
Goodness is combined by two steps.The first step merges discrimination property and diversity, they reflect an attribute mapping ensemblen Static goodness, goodness is updated in conjunction with diversity to the static goodness descending of attribute mapping ensemblen row's step.Second step combines multiplicity Property updates goodness.If N is the attribute mapping cluster of sequence, for a mapping ensemblen Ni∈ N, mapping ensemblen NiWhole goodness is good(Ni):
comb(Ni)=α discr (Ni)+(1-α)·abund(Ni)
Wherein, α is the weight of discrimination property goodness, and γ is the weight of static goodness, 0≤alpha, gamma≤1;comb(Ni) it is quiet State property set goodness, incorporate discrimination property and it is rich.
Other steps and parameter are identical as one of specific embodiment one to seven.
Specific embodiment 9: unlike one of present embodiment and specific embodiment one to eight, the step 6 In obtain attribute mapping cluster in each mapping ensemblen goodness after, in block carry out entity resolution work, thus exclude accidentally be included in This block is directed toward the data record of other entities;Calculating process is as follows:
There is record to ri、rj, r at this timeiThere are m attribute, rjThere is n attribute, wherein have p, and p by the attribute of mapping ≤ min (m, n), the attribute of mapping are { att1,att2,...,attp, riWith rjSimilarity are as follows:
Wherein, NlFor attribute mapping ensemblen, simcontent(ri.attl, rj.attl) it is ri、rjThe category of two record attribute mappings Property value similarity;attlFor attribute corresponding to an attribute mapping ensemblen, this mapping ensemblen contains riWith rjIn a certain attribute reflect It penetrates;The similarity that two record is compared with preset threshold value λ, if more than threshold value λ, then it is assumed that matching;Due to sequentially Processing, so, when record is to matching, a new record is merged into, new record covers the information of former record pair;simcontent() Calculation method and integration process be described in detail below.
The simcontent(ri.attl, rj.attl) specific solution procedure are as follows: the record information of class regular expression is whole It closes
In the merging process of record, it can use one kind and be similar to regular expression method to close to information It is all that a rule is determined to a category information and because this information merges and regular expression has some similarities.Merge Clock synchronization is recorded, two values of attribute mapping are combined into a class regular expression.
Class regular expression concept
Due to the attribute of mapping, its value has similitude, and unification can be carried out to same section, and different piece is retained, with This forms a class regular expression, can effectively merge match information.
Class regular expression (Similar to Regular Expression, StRE): ∑ is set as an alphabet, ε is Null character.One class regular expression StRE=S [1] S [2] ... S [n].Wherein for any i (1≤i≤n), element S [i] ={ ci,1,ci,2,...,ci,ni, for j (1≤j≤ni), ci,j∈∑∪{ε}.Class regular expression is referred to as expressed below Formula.
R is attribute value pair, and S is according to attribute value to the expression formula derived, and G is the set of S instantiation, at this time R ∈ G. As class regular expression { t, n } ight can dissolve tight and night with example.
The expression formula of one attribute value is its value itself.And expression formula made of being merged as two attribute values or a category Property value and expression formula merge made of class regular expression then need to make inferences calculating.
(1), the similarity of class regular expression is calculated;Detailed process are as follows:
The expression formula of one attribute value is itself, calculates two expression formula similarities, can use editing distance;Note editor Distance function is D (i, j), and i, j respectively indicate the length of two character strings a and b, calculate editing distance using Dynamic Programming; Obtain the editing distance similarity function sim between character stringedit(a,b):
The similarity an of expression formula and attribute value or the similarity of two expression formulas are calculated, with editing distance It is similar.Since there may be multiple elements or null character in expression formula, so slightly changing to editing distance function, principle is changed It is as follows: (1) for multiple characters, if containing identical characters in two expression elements, expression element matching;(2) right In null character, if mismatched, the expression element comprising null character takes null character, does not account for length at this time, to editing distance without It influences.
The smallest edit distance of two expression formulas is obtained by this above principle;Expression formula S1、S2Editing distance D (| S1|,|S2|) it is as follows:
S [i] represents i-th of element of expression formula, S1[i] expression S1I-th of element, S2[j] expression S2J-th of element, and have primary condition:
Wherein, the function MU principle of correspondence (1) matches as long as containing identical character between expression element;Principle (1) For multiple characters, if containing identical characters in two expression elements, expression element matching;
The function NU principle of correspondence (2), null character if it exists in expression element, then directly ignored, not to editor away from From calculating have an impact, editing distance can be maintained minimum;Principle (2) is for null character, includes null character if mismatched Expression element take null character, length is not accounted at this time, on editing distance without influence.The editing distance of most latter two expression formula is D(|S1|,|S2|);
At this point, the editing distance similarity between expression formula is simedit(S1,S2):
Wherein, simedit(S1,S2) it is simcontent(), S1With S2For function simcontent(ri.attl, rj.attl) in Two attribute value attl
In comparison merging process, regard the attribute value of record as one by one expression formula and calculate, comparing calculation is completed Afterwards, attribute information is generated according to expression formula.
(2), the generation of class regular expression
New expression formula is generated according to two expression formulas, principle is to introduce least unrelated example.For example, attribute value pair Cute kid and cut kind, then can have expression formula S1=cut { e, ε } ki { d, n } { d, ε }, S18 examples can be obtained, introduce 6 unrelated examples, can also there is expression formula S24 examples can be obtained in=cut { e, ε } ki { n, ε } d, introduce 2 unrelated realities Example.So S2Better than S1
The present invention utilize expression formula editing distance matrix M, from M [| S1|,|S2|] set out, M [0,0] is traced back to always, i.e., The expression formula of expression formula can be obtained, create-rule is as follows:
Particularly
Wherein, k is a certain position in trace-back process, and S [k] is the element of current location, and knot is recalled as i=j=0 Beam;At this point, expression formula S1、S2Be merged into class regular expression be S=...S [k] ... S [0] (wherein k by backtracking sequence inverted order Arrangement).And in the case where three conditions of generation meet simultaneously, the priority of function MU is higher than function NU.Facilitate at this time The merging of identical characters reduces the introducing of unrelated example.NU(S2[j]) it is expression formula S2The corresponding function of j-th of element, NU (S1[i]) it is expression formula S1The corresponding function of i-th of element, MU (S1[i],S2[j]) it is expression formula S2J-th of element and table Up to formula S1The corresponding function of i-th of element;
Illustrate the similarity calculation and generating process of class regular expression with an example.If attribute value attvalue1=" cute kid ", attvalue2=" cut kind ".And this corresponding expression formula is respectively S1=cute Kid, S2=cut kind.Distance matrix can be obtained according to the generation formula of class regular expression at this time, as shown in table 1.
The similarity calculation and generator matrix of 1 class regular expression of table
Wherein each value M [i, j] in matrix indicates S1Preceding i element and S2The editing distance of preceding j element, due to this When expression formula be attribute value itself, so the editing distance of expression formula be equal to attribute value editing distance.S1With S2Editing distance For M [7,7]=2, expression formula similarity is 1-2/7=5/7.Assuming that the similarity of two records has been more than threshold value at this time, it is true Recognize matching, then needs to merge two attribute values.According to the create-rule of expression formula, on the basis of expression formula matrix M, from M [7, 7] start to recall.M [7,7]=M [6,6]+MU (S1[6],S2[6]), so tracing back to the position of [6,6] M, and so on backtracking. Recall path overstriking in table 1 to mark.Finally obtain expression formula cut { e, ε } ki { ε, n } d.
(3), the generation of entity information
The attribute value in attribute mapping ensemblen is closed by Record Comparison step by step with information in block after treatment And generate, finally obtain the final expression formula of an attribute mapping ensemblen.It can remember in the generating process of class regular expression The frequency for recording the appearance of each element, using this class regular expression with frequency, in each class regular expression element There is value of the maximum character of frequency as the element in selection, ultimately produces the highest character string of the frequency of occurrences as the category The attribute value of property mapping ensemblen.Such as three attribute value Mike Doe, M.Doe, Mike D., obtain expression formula M { i:2 .:1 } k:2, ε: 1 } { e:2, ε: 1 } D { o:2 .:1 } { e:2, ε: 1 }, finally obtains attribute value Mike Doe according to the frequency.
This type regular expression, for causing error character due to typesetting, spelling etc., this error character occurs The frequency must be less than the character correctly spelt, so the maximum character of the frequency is taken to facilitate noise filtering.
Other steps and parameter are identical as one of specific embodiment one to eight.
Beneficial effects of the present invention are verified using following embodiment:
Embodiment one:
The present embodiment is specifically to be prepared according to the following steps:
Experimental data set
Piecemeal experiment uses two datasets, referred to as D1And D2。D1It is drawn from film common to DBPedia and IMDB Message data set.D2It is drawn from the data set that the DBPedia of two versions is constituted[9](George Papadakis, Jonathan Svirsky,Avigdor Gal,Themis Palpanas.Comparative analysis of approximate blocking techniques for entity resolution[J].Proceedings of the VLDB Endowment,2016,9(9):684-695P).Wherein D1There are 50796 records, includes 22403 entities, D2Include 335479 records, share 89258 entities.
Experimental evaluation standard
Herein using the evaluation criterion used[10](Chirag Nagpal,Kyle Miller,Benedikt Boecking,Artur Dubrawski.An Entity Resolution approach to isolate instances Of Human Trafficking online [J] .Computer Science, 2017,3 (18): 10-18P):
Striping criterion: to integrality (PC), cut rate (RR), F value (F=2 × PC × RR/ (PC+RR));
Parsing standard: accuracy rate (P), recall rate (R) and F value (F=2 × P × R/ (P+R)).
The experimental analysis of method of partition
In blocking process, the present invention describes four kinds of pruning methods, is by the parameter alpha setting value for calculating record similarity 0.6, then carry out beta pruning: for the threshold value beta pruning scheme (ECWP) of side centralization, in data set upper threshold value wmin0.6 is taken respectively, When 0.5, F value highest;For the threshold value beta pruning scheme (NCWP) of node centralization, experiment is evenly distributed based on one kind it is assumed that making With unified weight threshold, implementation procedure and threshold value are identical as ECWP;For the radix beta pruning scheme (ECCP) of side centralization, When beta pruning, need to retain total k side, k is with sum | E | and change, take k=0.5* | E | when, F value reaches on both data sets To highest;For the radix beta pruning scheme (NCCP) of node centralization, during beta pruning, to each node riThe side connected, Retain kriItem is still based on is evenly distributed it is assumed that taking k at this timeri=0.5* | Eri| when, F value reaches highest.At this point, for four Kind pruning method, takes it to show optimal threshold value, compares, as shown in table 2.
PC, RR value of 2 pruning method of table on both data sets
It can be seen that the pruning method of side centralization has cut more useless records pair, efficiency emphatically, be suitble to scale compared with Big entity resolution task, especially in the less expection situation of matching entities;And for the pruning method of node centralization, it protects Stay more matching record pairs.While centralization algorithm remain weight be top-k or weight be greater than preset threshold while, and tie Dot center's algorithm ensures that each node is connected with its most like record, is more suitable for focusing on the application of accuracy;Weight Thresholding algorithm remains the higher record pair of similarity, ensure that the accuracy of algorithm;Radix threshold value controls record to be compared Pair quantity, remain the side of weight top-k, can have an impact to accuracy, but is guaranteed to method efficiency.
Whole accuracy is held in 97% or more, and economy is also maintained at 60% or more, illustrates to cut block figure Branch, to a certain amount of record ignored pair is casted away, ensure that efficiency of algorithm to a certain extent.
The experimental analysis of attribute mapping and expression formula
In the mapping goodness of computation attribute cluster, parameter value is set as α=0.5, λ=0.4 by experiment.Choose the shortest distance Method carries out Comparative result.Result assessments are carried out to two stages: first is that entity resolution as a result, but the life of real entities information At.
For entity resolution as a result, the accuracys rate of two methods, recall rate as shown in Fig. 5 a, 5b.As can be seen that with The rising of threshold value, recall rate decline, accuracy rate are substantially increased.On both data sets, when threshold value takes 0.6, context of methods reaches Higher F value is arrived.On both data sets, the recall rate difference of two methods is little for method, and is mapped herein based on attribute Make recall rate slightly good with the processing method of class regular expression.Accuracy rate is significantly better than the method based on the shortest distance, accurately Rate is higher.Show that methods herein more adapts to the record feature of data space.And the shortest distance compared with rely on data semanteme, in language Under the stronger environment of justice, bigger effect can be sent out.
For the generation phase of entity information, most of entity resolution work focuses on parsing result, and to entity Information and the generation concern merged with entity be not high.Utilize data set D1, generation entity number and actual entities number on it is such as Shown in Fig. 6.Entity generation is carried out using the optimal threshold generated during entity resolution.It can be seen from the figure that parsing herein As a result more acurrate, it generates entity number and is closer to real entities number.But its accuracy rate is still significantly higher than knearest neighbour method.
The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims (9)

1. the entity resolution method in data-oriented space, it is characterised in that: the method detailed process are as follows:
Step 1: building record figure;
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.
2. the entity resolution method in data-oriented space according to claim 1, it is characterised in that: constructed in the step 1 Record figure;Detailed process are as follows:
Similarity between two step 1 one, calculating records;
Step 1 two constructs record figure according to the record and similarity of data space.
3. the entity resolution method in data-oriented space according to claim 2, it is characterised in that: the step 1 one is fallen into a trap Calculate the similarity between two records;Detailed process are as follows:
Step 1 one by one, calculate label similarity:
Record is switched into tag set by label transfer function tag (), the label similarity of two records is calculated, is denoted as simtag(ri,rj):
Wherein, T (ri) it is that r will be recorded by label transfer functioniThe standardization tally set being converted into;T(rj) it is to be turned by label Exchange the letters number will record rjThe standardization tally set being converted into;
Step 1 one or two, calculated relationship similarity:
Two comprehensive similarities recorded in possessed all relationships are incorporated, sim is denoted asrel(ri,rj):
Wherein, Nbr (ri) indicate and record riThere are the set of records ends of connection, Nbr (r in rel relationshipj) indicate and record rj? There is the set of records ends of connection in rel relationship, REL is indicated in record r1、r2All record set of relationship of upper appearance;
Step 1 one or three integrates label similarity and relationship similarity, obtains comprehensive similarity sim (ri,rj):
sim(ri,rj)=α simtag(ri,rj)+(1-α)·simrel(ri,rj)
Wherein, α indicates the weight of label similarity.
4. the entity resolution method in data-oriented space according to claim 3, it is characterised in that: root in the step 1 two According to record and similarity the building record figure of data space;Detailed process are as follows:
The set of records ends R of a given data space, constructs a non-directed graph G=(R, E), referred to as record figure;
Wherein R is set of records ends, represents the record in data space;E is side collection, represents and remembers there are a line between two records The similarity of record pair.
5. the entity resolution method in data-oriented space according to claim 4, it is characterised in that: used in the step 2 Pruning method simplifies record figure;Detailed process are as follows:
Pruning method use while centralization radix beta pruning, node centralization radix beta pruning, while centralization threshold value beta pruning or One of threshold value beta pruning of node centralization;
(1) the radix beta pruning of side centralization:
Global radix threshold value k, which specifies record figure, will retain the sum on side, retain the side of k maximum weight;
(2) the radix beta pruning of node centralization:
For each node ri, retain link node riTop-k weight side;
(3) the threshold value beta pruning of side centralization:
Beta pruning is carried out in global scope using weight threshold, chooses minimum side right wmin, weight is lower than by all sides in traversing graph wminEdge contract;
(4) the threshold value beta pruning of node centralization:
One unified threshold value, the weight beta pruning scheme phase of beta pruning process and side centralization are used to all nodes of record figure Together.
6. the entity resolution method in data-oriented space according to claim 5, it is characterised in that: to cutting in the step 3 Record figure after change carries out piecemeal processing;Detailed process are as follows:
Figure after beta pruning is G={ R, E }, and R is set of records ends, and E is side collection;
Appoint and takes a record ri, create a block biAnd by riIt is placed in biIn, if riNode r in point regionjWith biIn all knots Point has Bian Xianglian, then by rjIt is placed in biIn, and delete rjWith biIn all nodes connected side, repeat this operation until time Go through riAll neighbor nodes;At this point, if biIn node become an isolated node in figure, it is boundless to be attached thereto, then from This node is deleted in figure;Step 3 is repeated until figure G is sky;At this point, piecemeal work is completed, set of blocks B={ b is obtained1, b2,...,b|B|}。
7. the entity resolution method in data-oriented space according to claim 6, it is characterised in that: established in the step 4 Attribute maps cluster;Detailed process are as follows:
Step 4 one, for two attributes from different entities, the similar value of computation attribute:
The similar value of attribute, is denoted as SV, all values of two attributes are compared, retains highest similarity score, obtains attributes match It is right;
If I is NameSpace number, J is matched attributes match logarithm, the NameSpace n different for twos、nt, ns、ntUnder Attribute number is denoted as Ms、Mt, nsUnder a-th of attribute, ntUnder b-th of attribute be denoted as p respectivelya s、pb t, by maximizing all Properties pair of total matching probability matches to optimize integrity attribute, meets global 1:1 limitation with this:
Wherein σ (pa s,pb t) it is attribute to pa sAnd pb tMatching probability;Θ(pa s,pb t) it is indicator function, when selection attribute pair pa s、pb tCome when forming an attribute mapping ensemblen, otherwise functional value 1 is 0;
Step 4 two is arranged attributes match by matching probability descending set, is sequentially handled, for attribute to pa、pbIf P is not separately included in attribute mapping cluster Na、pbAttribute mapping ensemblen Ni、Nj, then attribute mapping ensemblen { p is added into Na、pb}; If it includes p in cluster N that attribute, which maps,aAttribute mapping ensemblen NiAnd comprising coming from and p in attribute mapping cluster NbSame NameSpace Attribute, delete pa≈pbThis attribute pair;Otherwise, merge Ni、NjThe attribute mapping ensemblen N bigger as onek, by NkIt is added to In N, while N is deleted from Ni、Nj;Step 4 two is repeated until J is sky.
8. the entity resolution method in data-oriented space according to claim 7, it is characterised in that: calculated in the step 5 The goodness of attribute mapping ensemblen;Detailed process are as follows:
Pass through following three aspects computation attribute mapping ensemblen goodness:
A, discrimination property:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, a discrimination goodness function is defined on R, is denoted as discr (Ni):
Wherein, val (r, Ni) be extracted and be recorded in attribute C in attribute mapping ensembleniOn attribute value, norm () be to separate sources Attribute value standardize;U is union symbol;
B, rich:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, an abundant goodness function is defined on R, is denoted as abund (Ni):
Have at this timeΘ () representative function name, if record r is in property set NiOn have value, Then this functional value is 1, and no value is then 0;
Distinguish property and it is rich sum up, be ranked up from big to small, by from big to small sequence calculate diversity;
C, diversity:
If NiFor current mapping ensemblen, NselectedFor the mapping cluster having been selected;For giving Nselected, NiDiversity note For div (Ni|Nselected):
Wherein, SiIt indicates by attribute mapping ensemblen NiThe set of records ends of instantiation, i.e.,SjIndicate by Attribute mapping ensemblen NjThe set of records ends of instantiation, i.e.,Sv(vx,vy) it is vx、vySimilarity, vx For attribute value of the record r at Ni, vyIt is record r in NjUnder attribute value;div(Ni|Nj) indicate to choose NjIt is recorded as calculating After the calculating mapping ensemblen of similarity, whether N is chosen consideringiWhen recording similar mapping ensemblen as calculating, N is considerediInformation and NjInformation multiplicity;
If N is the attribute mapping cluster of sequence, for a mapping ensemblen Ni∈ N, mapping ensemblen NiWhole goodness is good (Ni):
comb(Ni)=α discr (Ni)+(1-α)·abund(Ni)
Wherein, α is the weight of discrimination property goodness, and γ is the weight of static goodness, 0≤alpha, gamma≤1;comb(Ni) it is static belong to Property collection goodness, incorporate discrimination property and it is rich.
9. the entity resolution method in data-oriented space according to claim 8, it is characterised in that: obtained in the step 6 In attribute mapping cluster after the goodness of each mapping ensemblen, entity resolution is carried out in block;Calculating process is as follows:
There is record to ri、rj, r at this timeiThere are m attribute, rjThere is n attribute, wherein have p, and p≤min by the attribute of mapping (m, n), the attribute of mapping are { att1,att2,...,attp, riWith rjSimilarity are as follows:
Wherein, NlFor attribute mapping ensemblen, simcontent(ri.attl,rj.attl) it is ri、rjThe attribute value of two record attribute mappings Similarity;attlFor attribute corresponding to an attribute mapping ensemblen;
The similarity that two record is compared with preset threshold value λ, if more than threshold value λ, then it is assumed that matching;Due to sequentially Processing, so, when record is to matching, a new record is merged into, new record covers the information of former record pair.
CN201910435269.5A 2019-05-23 2019-05-23 Entity analysis method for data space in movie information data set Expired - Fee Related CN110147393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910435269.5A CN110147393B (en) 2019-05-23 2019-05-23 Entity analysis method for data space in movie information data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910435269.5A CN110147393B (en) 2019-05-23 2019-05-23 Entity analysis method for data space in movie information data set

Publications (2)

Publication Number Publication Date
CN110147393A true CN110147393A (en) 2019-08-20
CN110147393B CN110147393B (en) 2021-08-13

Family

ID=67593022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910435269.5A Expired - Fee Related CN110147393B (en) 2019-05-23 2019-05-23 Entity analysis method for data space in movie information data set

Country Status (1)

Country Link
CN (1) CN110147393B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221995A (en) * 2019-10-10 2020-06-02 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145286A1 (en) * 2009-12-15 2011-06-16 Chalklabs, Llc Distributed platform for network analysis
CN102646137A (en) * 2012-04-19 2012-08-22 中国人民解放军总参谋部第六十三研究所 Automatic entity basic information generation system and method based on Markov model
CN106202502A (en) * 2016-07-20 2016-12-07 福州大学 In music information network, user interest finds method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145286A1 (en) * 2009-12-15 2011-06-16 Chalklabs, Llc Distributed platform for network analysis
CN102646137A (en) * 2012-04-19 2012-08-22 中国人民解放军总参谋部第六十三研究所 Automatic entity basic information generation system and method based on Markov model
CN106202502A (en) * 2016-07-20 2016-12-07 福州大学 In music information network, user interest finds method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
甄灵敏等: "基于属性权重的实体解析技术", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221995A (en) * 2019-10-10 2020-06-02 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory
CN111221995B (en) * 2019-10-10 2023-10-03 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory

Also Published As

Publication number Publication date
CN110147393B (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN101093559B (en) Method for constructing expert system based on knowledge discovery
CN108874878A (en) A kind of building system and method for knowledge mapping
CN107103050A (en) A kind of big data Modeling Platform and method
CN107193967A (en) A kind of multi-source heterogeneous industry field big data handles full link solution
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN109033303A (en) A kind of extensive knowledge mapping fusion method based on reduction anchor point
WO2021128158A1 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
CN105706092B (en) The method and system of four values simulation
CN107193882A (en) Why not query answer methods based on figure matching on RDF data
CN112559766A (en) Legal knowledge map construction system
CN113254630A (en) Domain knowledge map recommendation method for global comprehensive observation results
CN111930892B (en) Scientific and technological text classification method based on improved mutual information function
CN113487024A (en) Alternate sequence generation model training method and method for extracting graph from text
WO2024051000A1 (en) Structured simulation data generating system and generating method
CN111858962A (en) Data processing method, device and computer readable storage medium
CN112115971A (en) Method and system for portraying scholars based on heterogeneous academic network
CN109189941A (en) For updating the method, apparatus, equipment and medium of model parameter
CN115329210A (en) False news detection method based on interactive graph layered pooling
CN111897911B (en) Unstructured data query method and system based on secondary attribute graph
CN110147393A (en) The entity resolution method in data-oriented space
Kobren et al. Integrating user feedback under identity uncertainty in knowledge base construction
CN116266189A (en) Robot question-answering system based on fusion map
Estevez-Velarde et al. Demo application for leto: Learning engine through ontologies
Hirchoua et al. Topic hierarchies for knowledge capitalization using hierarchical Dirichlet processes in big data context
Rizzo et al. Inductive classification through evidence-based models and their ensembles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210813

CF01 Termination of patent right due to non-payment of annual fee