CN103425711B - Object value alignment schemes based on many object instances - Google Patents

Object value alignment schemes based on many object instances Download PDF

Info

Publication number
CN103425711B
CN103425711B CN201210166855.2A CN201210166855A CN103425711B CN 103425711 B CN103425711 B CN 103425711B CN 201210166855 A CN201210166855 A CN 201210166855A CN 103425711 B CN103425711 B CN 103425711B
Authority
CN
China
Prior art keywords
value
attribute
mrow
pair
property
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210166855.2A
Other languages
Chinese (zh)
Other versions
CN103425711A (en
Inventor
姜珊珊
郑继川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201210166855.2A priority Critical patent/CN103425711B/en
Priority to JP2013109182A priority patent/JP2013246826A/en
Publication of CN103425711A publication Critical patent/CN103425711A/en
Application granted granted Critical
Publication of CN103425711B publication Critical patent/CN103425711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of method of the property value of the isomery example of align objects, including:Attribute-name to the property value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;To belong to obtained domain features under property value to all properties value in set to being ranked up;And suitable public substring is selected from all properties value of all properties value centering after sequence as the object value of the object.Present invention also offers a kind of system of the property value of the isomery example of align objects.

Description

Object value alignment schemes based on many object instances
Technical field
The present invention relates to the method and system that a kind of property value of multiple isomery examples to object is alignd.
Background technology
With the popularization of internet, people obtain oneself resource interested by internet more and more, and will money Source is arranged, the need for meeting oneself.There are the web page contents illustrated to the specification of various products on internet, these The attribute and property value of various products is generally expressly recited in web page contents.People, can be right in order to obtain these content resources The attribute and property value of these product objects carry out information extraction, and set up object database based on the information extracted.But It is that the attribute and property value of the object are being indicated for same target (i.e. a kind of product) by different internet web page providers When, the language that is used, wording, attribute number and property value form in terms of all have differences, and object is produced Comment, sequence and the description page of product largely exist on the internet, and this just constitutes same target exists on the internet The example (webpage or content page that describe each attribute of the object) of isomery (heterogeneous).How from numerous and jumbled Internet resources in extract specific area (domain) due to the spy of isomery instance objects (object) caused by above-mentioned difference Levying (feature) can facilitate the data content oneself utilized to be a kind of desirable technology to be integrated into.
The Chinese Patent Application No. that present applicant was submitted for 2 months to Patent Office of the People's Republic of China on the 14th in 2012 201210032507.6 are just referred to the processing that the attribute progress to various isomery examples can be clustered.Disclosure of which is overall Completely include by reference in this application.Therefore, by the way of disclosed by above-mentioned patent application to object or The isomery example of product is carried out after domain features cluster, it is necessary to which the property value after the cluster is further processed to obtain A kind of representational value.It is exactly specifically to property value carry out value sequence and value standardization.Most prior arts focus on spy Determine field, realm information is also difficult to collect, it is necessary to substantial amounts of manpower, but this kind of method generally yields good result.On From isomeric data concentrate choose most representational one (or multiple) technology more come across query expansion or image procossing neck Domain.Because target data set is different, sequence and the method extracted are also not quite similar.United States Patent (USP) US8035855B " Automatic Selection of a subset representative pages from a multi-page document " are provided A kind of method for choosing the most representative page automatic from multi-page document.United States Patent (USP) US6728704B " Method and Apparatus for merging result lists from multiple search engines " provide one kind of knowing clearly Merge the method and system of multiple search engine the results list;American invention discloses US20110145289 A1 " System and Method For Generating A Pool of Matched Content " disclose a kind of side for generating matching content pond Method and system.However, these inventions would generally be specific to a certain field or language, without universality.Accordingly, it is desirable to carry Carry out processing to the property value after the cluster to obtain a kind of representative value and obtain for a kind of unrelated field and language pair The method of acceptable result precision.
The content of the invention
In view of the above-mentioned problems in the prior art and make the present invention.The present invention relates generally to information processing and letter Breath integrates correlation technique, and the method alignd of property value more particularly, to multiple isomery examples to object and is System, i.e. after standardizing to the attribute of multiple isomery examples of object, from numerous property values of same standardization attribute The method and system of one most representational property value (or multiple) of middle selection or generation.
According to an aspect of the invention, there is provided a kind of method of the property value of the isomery example of align objects, including: Attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;To belonging to Attribute-value under the domain features obtained is to all properties-value in set to being ranked up;And from all after sequence Suitable public substring is selected in all properties value of attribute-value centering as the object value of the object.
According to one embodiment of the present invention, to belong to obtained domain features under attribute-value in set All properties-value to be ranked up including:Based on source of the attribute-value to the object instance of each attribute-value pair in set Calculate the importance score value of the attribute-value pair;It is every to the Similarity Measure between the attribute-value pair in set based on attribute-value One attribute-value pair apart from score value;Based on attribute-value to the similarity between the property value of the attribute-value centering in set, meter Calculate the frequency score of attribute-value pair;It is to the property value of the attribute-value centering in set and same with the object based on attribute-value Similarity between the existing object value of other objects in field, calculates the evidence score value of the attribute-value pair;And be based on At least two score values in above-mentioned calculated score value, perform weighted sum, to calculate the attribute-value to each in set The total score of attribute-value pair.
According to one embodiment of the present invention, based on attribute-value to the similarity meter between the attribute-value pair in set Calculate including apart from score value for each attribute-value pair:Any one attribute-value centering is calculated by comparing the method for character string The similarity of attribute-name and domain features;Computation attribute-value belongs to any one attribute-value pair described in set with other Property-average hybrid similarity of the value between;Following weighted sum meter is performed to the similarity and average hybrid similarity calculated Calculate obtain any one attribute-value pair apart from score value.
According to one embodiment of the present invention, based on attribute-value between the property value of the attribute-value centering in set Similarity, computation attribute-value to frequency score include:Computation attribute-value is to any one attribute-value centering in set Property value and other attribute-value centerings property value between similarity;Compare the category of any one attribute-value centering Property value and other attribute-value centerings property value between each similarity and a predetermined threshold, and it is big to count similarity In the number of the value of the threshold value;Calculate counted number and account for ratio of the attribute-value to the number of attribute-value pair in set.
According to one embodiment of the present invention, the evidence score value for calculating the attribute-value pair is to collection based on attribute-value The property value of attribute-value centering in conjunction and being averaged between the existing object value of other objects of the object same domain Hybrid similarity.
According to one embodiment of the present invention, selected from all properties value of all properties after sequence-value centering Suitable public substring includes as the object value of the object:Compare number of the attribute-value to the attribute-value pair in set Amount and predetermined maximum-norm threshold value and smallest size threshold value, adaptively filter to execution to all properties-value and are made an uproar with eliminating Sound;And the property value execution value of the attribute-value pair after filtration treatment is extracted, so as to therefrom select suitable public sub- character The object value gone here and there as the object.
According to one embodiment of the present invention, compare attribute-value to the quantity of the attribute-value pair in set with it is predetermined Maximum-norm threshold value tL and smallest size threshold value sL, all properties-value is adaptively filtered to execution to be included with eliminating noise: If N >=tL, retain attribute-value pair of all properties-value after sequence to preceding percentage x in set;If N≤sL, retain The attribute-value pair of all properties-value after sequence to percentage y before in set;Or in the situation for being unsatisfactory for above two conditions Under, retain attribute-value pair of all properties-value after sequence to preceding percentage z in set, wherein x, y, z ∈ [0,1] and y >=z ≥x。
According to one embodiment of the present invention, the property value execution value to the attribute-value pair after filtration treatment is extracted, from And therefrom select suitable public substring to include as the object value of the object:Computation attribute-value is to the category in set Property-value to property value average length;Computation attribute-value is to each word in the property value of the attribute-value pair in set at this The frequency occurred in all properties value calculates the score value of each word;Extract attribute of the attribute-value to the attribute-value pair in set The public substring of the character string of value, and length in the public word string extracted is less than or equal to lenavgThe time as object value Choosing value;And the fraction of each candidate value obtained to all word score values summation in the character string of each candidate value, and by highest Candidate value corresponding to fraction is used as final object value.
According to another aspect of the present invention there is provided a kind of system of the property value of the isomery example of align objects, bag Include:Attribute-name normalizing block, the attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing Obtain domain features;Be worth order module, to belong to obtained domain features under attribute-value to all properties in set- Value is to being ranked up;And property value normalizing block, selected from all properties value of all properties after sequence-value centering Suitable public substring as the object object value.
By reading the detailed description of preferred embodiment of the invention below being considered in conjunction with the accompanying, this is better understood with Above and other target, feature, advantage and the technology and industrial significance of invention.
Brief description of the drawings
Shown in Figure 1A and 1B is to implement the illustrative example that the present invention is targeted object instance.
Shown in Fig. 2 is the method alignd according to the property value of multiple isomery examples to object of the present invention and is System general illustration.
Shown in Fig. 3 is the schematic diagram handled according to the execution " value alignment " of the present invention.
The illustrative diagram that an eigenmatrix is obtained after " value is alignd " processing is performed according to the present invention shown in Fig. 4
The method that the property value that Fig. 5 show multiple isomery examples according to the present invention to object is alignd is shown Meaning property block diagram.
Fig. 6 show the method ensemble stream alignd according to the property value of multiple isomery examples to object of the present invention Cheng Tu.
Shown in Fig. 7 is the flow chart calculated apart from fraction of the execution attribute-value pair according to the present invention.
Shown in Fig. 8 is the flow chart calculated according to the frequency score of the execution attribute-value pair of the present invention
Shown in Fig. 9 is to carry out the total of standardization processing according to the property value to the attribute-value pair after sequence of the present invention Body flow chart.
Shown in Figure 10 is the property value carry out value extraction to the attribute-value pair after adaptive filtering processing according to the present invention Flow chart.
Shown in Figure 11 is an experimental result according to the inventive method.
Embodiment
The embodiment of the present invention is described below in conjunction with the accompanying drawings.
Shown in Figure 1A and 1B is to implement the illustrative example that the present invention is targeted object instance, to the present invention The term being previously mentioned is explained, to facilitate skilled artisan understands that the present invention.But, example exemplified herein is simultaneously The object instance of the present invention is not defined.As shown in Figure 1A and 1B, by " exemplified by Ricoh CX5 " camera product objects, List two object instances from different network address, i.e. example 1 and example 2.As shown in Figure 1A and 1B, the object (Ricoh CX5 the description that 2 pairs of objects of example 1) and example are carried out is significantly different, with different wording, style and structure, therefore, The example of this describing mode having differences is referred to as " isomery " example of object by applicant.Referring to shown in Figure 1A and 1B, often Attribute and property value, i.e. attribute-value pair are generally comprised in the description of individual object." attribute " can be the thing for description object Rationality matter or functional character, and " property value " or " value " is the specific descriptions to attribute.Because each object is generally with more The attribute of aspect, therefore also there are multiple attribute-values pair.Example 1 in Figure 1A for example has the property that:Optical sensor, Aperture, flash type, pixel, size of display and optical zoom multiple etc..Corresponding to these attributes, respectively with corresponding Property value.Equally, also have the property that-be worth in Figure 1B example 2 pair, it is numerous to list herein, refer to accompanying drawing.Need , it is noted that although attribute generally occurs in pairs with property value, some wherein property values can be sky.For institute in Fig. 1 Object " the Ricoh CX5 ", " Effective Pixels (valid pixel) "-" million of Approximately 10.00 shown Pixels (about 10,000,000 pixel) " and " Weight (weight) "-" Approx.197g (about 197 grams) " are all attribute-values pair.
It is of the invention to mention term " characteristics of objects " below." characteristics of objects " that the present invention is previously mentioned be by it is multiple semantically " attribute " of similar object instance attribute obtained from by integrating cluster, herein using " characteristics of objects " or " feature " come area Not in " attribute " of initial object example.For example feature " Resolution " can represent entitled " Resolution ", " Effective pixels ", the attribute of " Megapixels " etc..It is exactly by multiple isomery objects in cluster of the present invention The attribute with specified degree similitude in example is classified as a category feature.On how to " attribute " of object instance Integrate cluster and obtain " characteristics of objects " alleged by the present invention, that is, integrate the attribute after cluster, existing can be used What existing method, can also use everybody Chinese patent application for being submitted in 2012 to Patent Office of the People's Republic of China for 14th for 2 months of the application Cluster mode disclosed by numbers 201210032507.6.But how to cluster is not the object of the invention to be inquired into, therefore, no It is described in detail here.
In addition the present invention also refers to " specific area " this term.The present invention, which is previously mentioned " specific area ", can refer to one Field belonging to specific product.For example, " " the Canon 5D Mark II " that Ricoh CX5 " and Canon produce above-mentioned Belong to " digital camera " this specific area." smart mobile phone ", " navigator ", " aero-engine ", " economy car " etc. Deng, all it is likely to become " specific area " of the present invention, can also referred to as " specific field ".It is involved under each specific field And specific product is then " object of specific area ".
Shown in Fig. 2 is the method alignd according to the property value of multiple isomery examples to object of the present invention and is System general illustration.Specifically, be namely based on user it should be understood that specific area, scanned for via internet, obtain It is to be understood that specific area object instance (describing the webpage illustrated to the specification of object), it is real to the object of specific area All properties in example carry out specification of attribute processing, that is, carry out clustering processing, so that obtain each object of specific area Domain features.The domain features obtained are to carrying out at cluster to the attribute-value of the initial object instance for obtaining specific area The integrated results of attribute-value pair are obtained after reason.Based on the domain features obtained, " value alignment " processing is performed, to set up one Simple visible object-characteristic relation is planted, so that the structure applied to object database.
Shown in Fig. 3 is the schematic diagram handled according to the execution " value alignment " of the present invention.Because domain features are by multiple Hierarchical cluster attribute and obtain, so some feature of an object there may be the attribute-value pair of many isomeries, as shown in Figure 3.Hold Row " value alignment " processing is in order to from the property value of multiple isomery attribute-values pair of a feature (such as feature 3 in Fig. 3 Property value) the middle brief and correct representative property value of selection one, or multiple isomery attribute-values pair based on a feature Property value calculate a brief and correct representative property value.
By performing " value alignment " processing, the representative attribute of multiple features of each object in multiple objects is resulted in Value, so as to obtain a characteristics of objects matrix.Being performed according to the present invention after " value is alignd " is handled shown in Fig. 4 obtains a feature The illustrative diagram of matrix.Wherein specific area is " smart mobile phone ".Wherein longitudinal coordinate is object (the various product in field The smart mobile phone of board and model), lateral coordinates are domain features.Element in matrix is value of the object under feature.Structuring Eigenmatrix be object database a kind of exhibition method, can support more compare and statistics in terms of application.
The method that the property value that Fig. 5 show multiple isomery examples according to the present invention to object is alignd is shown Meaning property block diagram.Value order module S1 is ranked up according to multiple characteristics to property value.It is worth normalizing block S2 to ranked Property value is filtered and extracted processing and obtains the property value of characteristics of objects (to be different from initial attribute value, " object mentioned below Value " is final acquisition property value).The input of the system obtains domain features by object instance and by the specification of attribute. Its final output is object value.Intermediate result is ordering property value (hereinafter referred to as " ranking value ").
Fig. 6 show the method ensemble stream alignd according to the property value of multiple isomery examples to object of the present invention Cheng Tu.Wherein step S11-S15 be as the value sequencer procedure performed by value order module S1, and step S21-S22 be by value advise Value process of normalization performed by generalized module S2.
Value sequencer procedure is actually to being ranked up to all properties-value corresponding in each feature of an object Process.The value sequencer procedure includes:Calculate in step s 11 the importance scores of each attribute-value pair in a feature, Calculated in step S12 each attribute-value pair in a feature apart from fraction, in step s 13 calculate each in a feature The frequency score of the similar attribute-value pair of attribute-value centering property value, the evidence point for calculating in step S14 a feature Count and gross score is calculated to above-mentioned fraction weighted sum in step S15.It is not present successively between wherein step S11-S14 Sequentially.In implementation process of the present invention, user can be according to itself requirement to data precision, and whether decide in its sole discretion needs to perform The combination of any one step or wherein any several steps in step S11-S14, and held based on selected step to adjust The weighted value of gross score is calculated in row step S15.
Specifically, importance scores S is calculated in step s 11source, it is namely based on the importance imparting of attribute-value pair One fractional value, the fractional value is a normalized numerical value.And the importance of the attribute-value pair comes from the attribute-value to institute Whether the source of the object instance of category is important.If such as attribute-value to affiliated object instance user specific area First is ordered as in search result, then assigns its importance scores for " 1 ", if arranged in the specific area search result of user Sequence is second, then it is " 0.9 " etc. to assign its importance scores.Certainly, the occurrence of the importance scores can be according to user The need for the system according to the present invention run original state when assignment rule is set.
Shown in Fig. 7 is the flow chart calculated apart from fraction of the execution attribute-value pair according to the present invention.
In step S12, calculate each feature it is properties-value to apart from score value.Specifically, in step S121, meter Calculate the attribute-value centering attribute-name under a feature of an object and the similarity of the domain features.Specifically, at one In the set of attribute-value pair under feature, the corresponding attribute-name for calculating any one attribute-value pair is similar to the domain features Spend S1.It can be measured using arbitrary string similarity.In experiment using Dice distance metrics English text and SmithWaterman distance metric Chinese texts.In view of the Similarity Measure belongs to techniques well known, therefore do not enter herein Row description.
In step S122, the attribute-value under a feature of an object is calculated to any one attribute-value in set Average hybrid similarity S couple with other attribute-values pair in the set2.Wherein hybrid similarity consider attribute-name similarity, Property value similarity, attribute-name-property value intersect similarity.To the attribute-value pair from same object and different objects, mixing The calculating of similarity generally uses different matching strategies.Attribute-value from same target is relatively more similar to generally having Property value, therefore value similarity is more important for the attribute-value matching of same target, and to the attribute-value pair of different objects Attribute-name similarity is more important for matching.But in this step, only for the attribute under the same feature of same target- Value is to calculating.Specifically, for the Similarity value between attribute i and attribute j, for more than or equal to 0 and less than or equal to 1 Real number.The score value is obtained by three partial scores weighted sums.This most of score value be attribute-name similarity, property value similarity with And intersect similarity.Attribute-name similarity is obtained by the text similarity between computation attribute name.This Similarity Measure is Through being calculated in the prior art a variety of, therefore description is not repeated.Property value similarity passes through the distance between computation attribute value Obtain.After the specification of attribute (clustering processing), property value is generally made up of numerical value and linear module.To this property value Text similarity is not used to measure, but whether directly compare numerical value by Conversion of measurement unit equal, so as to carry out more accurate Match somebody with somebody.If property value remains as character string, computation attribute value can be carried out in the way of above-mentioned computation attribute title similarity Between similarity.Intersecting similarity is obtained by the similarity degree between metric attribute name and value, to excavate more Many occurrences, for example attribute-value is to " Pixels:18000000 " and " Resolution:18megapixels " merely enters Row attribute-name is matched and value matching all obtains the similarity of very little, and actually the two attribute-values be to that should be occurrence, if Cross-matched attribute-name " Pixels " and value " 18megapixels " are then very easy to find similar.Based on above three Similarity value One attribute-value pair of acquisition is weighted similar to the mixing between each attribute-value pair of other in set to attribute-value Degree, wherein when being weighted, the weights of each similarity can also can have with equal according to the particular content of feature Institute is different, but three weights and be 1.All hybrid similarities calculated are averaged, so as to obtain an attribute-value To the average hybrid similarity S in this feature2
In step S123, for any one attribute-value pair in a feature, based on the attribute calculated in step S121 Name and the similarity S of domain features1With the average hybrid similarity S calculated in step S1222, perform weighted sum and calculate, be somebody's turn to do Attribute-value pair apart from score value:
Sdistance=w1S1+w2S2
Wherein w1,w2∈ [0,1] and w1+w2=1, usual w1≥w2, behind the present invention in given EXPERIMENTAL EXAMPLE, w1=w2=0.5.
Shown in Fig. 8 is the flow chart calculated according to the frequency score of the execution attribute-value pair of the present invention.
As shown in figure 8, calculate frequency score in step s 13, it is popular for, be exactly computation attribute-value to appointing in set The property value of attribute-value pair of anticipating is higher than time of a predetermined threshold with the property value similarity of other attribute-values pair in the set Number.Step S13 includes step S131-S133.
In step S131, computation attribute-value is to the property value of any attribute-value pair in set with respect to other property values Similarity.Specifically, for the property value of nonumeric type, similarity is calculated using similarity of character string matching process.It is right In Numeric Attributes value, first determine whether whether linear module can mutually be changed.If convertible, contrast conversion after numerical value whether phase Together, otherwise still measured using similarity of character string.
In step S132, if a threshold epsilon ∈ [0,1], generally makes ε >=0.5.The threshold value can be according to corresponding to the attribute The unified degree that uses of property value set, if unified degree is higher, can be set to it is high a bit, for example 0.7 or 0.8, if unified degree is relatively low, can be set to it is slightly lower, such as 0.6.Threshold value based on the setting, system The number n for the value for being more than the threshold epsilon in all similarities for a property value calculated in step S131 is counted, wherein Comprising the property value with itself be compared obtain similarity be 1, therefore, be necessarily included in number n.
In step S133, normalized is performed according to the following equation, obtains a certain attribute-value pair in a feature The frequency score S of property valuefrequency
Wherein N be all properties-value under a feature to number.
In step S14, evidence score value is calculated.Generally, in same specific area under same domain features, different objects The property value of attribute-value pair has certain similitude.Identical spy based on other objects calculated according to the present invention Levy lower object value, calculate similarity between the property value and existing object value of the attribute-value pair of current signature, and be derived from Average hybrid similarity S between other object valuesevidence.Calculate mode and the step S122 calculating of average hybrid similarity Mode is identical, therefore is not repeated herein.
Finally, in sequence step S1, perform step S15 come calculate current attribute-value to total score Svalue.It is specific and Speech, is exactly to perform weighted sum to aforementioned four score value using following formula to calculate:
Svalue=wsSsource+wdSdistance+wfSfrequency+weSevidence
Wherein ws,wd,wf,we∈ [0,1] and ws+wd+wf+we=1, usual wd≥wf≥we≥ws
For current signature all properties-value to perform above-mentioned steps S1, be derived from all properties-value of this feature To total score SvalueThe attribute-value pair with putting in order is obtained so as to be sorted according to different score values.
In order to be that the feature of standardization assigns a final object value (property value standardized), therefore also need to pair The property value of the attribute-value pair of all sequences carries out standardization processing.
Shown in Fig. 9 is to carry out the total of standardization processing according to the property value to the attribute-value pair after sequence of the present invention Body flow chart.As shown in figure 9, to value normalizing block S2 input sorted after attribute-value pair and its it is respective always Fraction.In the step s 21, the noise figure in the property value after sequence is filtered through by adaptive filtering.Then, in step In S22, execution value extracts processing and obtains final object value.
Generally, in a data acquisition system, inevitably there is noise.These so-called noises include such as mistake Data, data that the set should not be belonged to etc..It is more accurate that the value extracted to perform public substring in step S22 extracts result Really, it is necessary to reduce noise data beforehand through adaptive filtering.Therefore, in step S21, to perform adaptive filtering, there is provided Two predetermined thresholds are advised to define the lower limit of maximum-norm and the upper limit of smallest size, i.e. maximum-norm threshold value tL and minimum respectively Mould threshold value sL.The integer more than 25 and the integer less than 15 have been respectively adopted in the EXPERIMENTAL EXAMPLE being described below as threshold value TL and sL.The setting of the two threshold values can be self-defined according to the scale of data set.According to embodiments of the present invention using such as lower section Formula carries out adaptive filtering.
Assuming that the number of attribute-value pair is N, then
As N >=tL, reserved property-value abandons remaining category to the attribute-value pair of percentage x before the sequence in set Property-value pair;
As N≤sL, reserved property-value abandons remaining category to the attribute-value pair of percentage y before the sequence in set Property-value pair;
If being unsatisfactory for above-mentioned two condition, attribute-value of the reserved property-value to percentage z before the sequence in set Pair and abandon remaining attribute-value pair.
Wherein, x, y, z ∈ [0,1] and y >=z >=x.X=60%, y=80%, z=70% are configured in experiment.
Shown in Figure 10 is the property value carry out value extraction to the attribute-value pair after the processing of adaptive misgivings according to the present invention Flow chart.
As shown in Figure 10, in step S221, the attribute-value after the processing of adaptive misgivings is calculated using below equation To the average length len of all properties value in setavg
The length len of its intermediate valuevalueIt is the number of word in property value character string, SvalueIt is the total score of each attribute-value pair Value.Afterwards, in step S222, using all words in the property value set of following formula calculating composition attribute-value pair each The fractional value S of the number of times occurred in setword
Wherein tfwordThe word frequency for being word in value set.If a word occurs in multiple property values, to the word institute All properties-value of category to total score SvalueWord frequency tf is multiplied by after summationword, so as to obtain the fractional value of the word.Exist afterwards In step S223, obtained using any existing method from the property value set of the set of the attribute-value pair after adaptive filtering The public substring of property value character string is taken, and length in acquired public substring is less than or equal to acquisition calculated above The average length len of all properties valueavgPublic substring as candidate target value (last attribute value i.e. corresponding to feature Candidate value).Then, in step S224, calculated according to following formula based on all words in each candidate's public substring Fractional value SwordCarry out the fraction that summation obtains each candidate's public substring:
Finally, in step S225, choose fraction highest candidate value in public substring as object value.
Shown in Figure 11 is an experimental result according to the inventive method (language is respectively English and Chinese).Show in figure The result for showing property value sequence is rational, and the object value after standardization meets optimization aim, i.e., briefly, correctly and with representative Property.
The sequence of operations illustrated in the description can be held by the combination of hardware, software or hardware and software OK.When performing this series of operation by software, computer program therein can be installed to the meter for being built in specialized hardware In memory in calculation machine so that computer performs the computer program.Or, computer program can be installed to hold In the all-purpose computer of the various types of processing of row so that computer performs the computer program.
For example, computer program can be prestored to the hard disk or ROM (read-only storage) as recording medium In.Or, (record) computer program can be temporarily or permanently stored into removable recording medium, such as floppy disk, CD- ROM (compact disc read-only memory), MO (magneto-optic) disk, DVD (digital versatile disc), disk or semiconductor memory.Can be this The removable recording medium of sample is provided as canned software.
The present invention is described in detail by reference to specific embodiment.It may be evident, however, that in the essence without departing substantially from the present invention In the case of god, those skilled in the art can perform to embodiment and change and replace.In other words, the shape that the present invention illustrates Formula is disclosed, rather than is explained with being limited.Judge idea of the invention, it is contemplated that appended claim.

Claims (9)

1. a kind of method of the property value of the isomery example of align objects, including:
Attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;
To belong to obtained domain features under attribute-value to all properties-value in set to being ranked up;And
Compare quantity N and predetermined maximum-norm threshold value tL and smallest size threshold value of the attribute-value to the attribute-value pair in set SL, is adaptively filtered to eliminate noise to all properties-value to execution;
Property value execution value to the attribute-value pair after filtration treatment is extracted, so that the object value of the object is therefrom selected,
Property value execution value wherein to the attribute-value pair after filtration treatment is extracted, so as to therefrom select the object of the object Value includes:
According to average length len of the formula below computation attribute-value to the property value of the attribute-value pair in setavg
<mrow> <msub> <mi>len</mi> <mrow> <mi>a</mi> <mi>v</mi> <mi>g</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&amp;Sigma;len</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> <msub> <mi>S</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> </mrow>
Its median length lenvalueIt is the number of word in property value character string, SvalueIt is the affiliated attribute-value pair of correspondence property value Total score;
According to below equation computation attribute-value to each word in the property value of the attribute-value pair in set in all properties value The frequency tf of middle appearancewordTo calculate the score value S of each wordword
<mrow> <msub> <mi>S</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>tf</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mo>&amp;Element;</mo> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </munder> <msub> <mi>S</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> </mrow>
Wherein SvalueIt is the total score of the affiliated attribute-value pair of property value where a word;
Public substring of the attribute-value to the character string of the property value of the attribute-value pair in set is extracted, and it is public by what is extracted Length is less than or equal to len in word stringavgThe candidate value as object value;And
To all word score value S in the character string of each candidate valuewordSum to obtain the fraction S of each candidate valuecandidate, and will Highest score ScandidateCorresponding candidate value is used as final object value.
2. the method as described in claim 1, wherein to belong to obtained domain features under attribute-value to the institute in set Have attribute-value to be ranked up including:
The importance of the attribute-value pair is calculated the source of the object instance of each attribute-value pair in set based on attribute-value Score value Ssource
Based on attribute-value to each attribute-value pair of Similarity Measure between the attribute-value pair in set apart from score value Sdistance
Based on attribute-value to the similarity between the property value of the attribute-value centering in set, computation attribute-value to frequency Score value Sfrequency
Based on attribute-value to the property value of the attribute-value centering in set and with other objects of the object same domain Similarity between some object values, calculates the evidence score value S of the attribute-value pairevidence;And
Based at least two score values in above-mentioned calculated score value, weighted sum is performed, to calculate the attribute-value to set In each attribute-value pair total score Svalue
3. method as claimed in claim 2, wherein it is possible to calculate total score based on all four score values with equation below Svalue
Svalue=wsSsource+wdSdistance+wfSfrequency+weSevidence
Wherein ws,wd,wf,we∈ [0,1] and ws+wd+wf+we=1.
4. method as claimed in claim 3, wherein, wd≥wf≥we≥ws
5. method as claimed in claim 2, wherein based on attribute-value to the similarity meter between the attribute-value pair in set Calculate each attribute-value pair apart from score value SdistanceIncluding:
The attribute-name of any one attribute-value centering and the similarity S of domain features are calculated by comparing the method for character string1
Computation attribute-value is to the average mixed phase described in set between any one attribute-value pair and other attribute-values pair Like degree S2, wherein, similarity is intersected to attribute-name similarity, property value similarity and attribute-name-property value and is weighted meter Calculate and obtain an attribute-value pair and attribute-value to the hybrid similarity between each attribute-value pair of other in set, to being counted All hybrid similarities calculated are averaged, so as to obtain the average hybrid similarity S2
To the similarity S calculated1With average hybrid similarity S2Perform following weighted sum and calculate any one described category of acquisition Property-value to apart from score value Sdistance
Sdistance=w1S1+w2S2
Wherein w1,w2∈ [0,1] and w1+w2=1, and w1≥w2
6. method as claimed in claim 2, wherein based on attribute-value between the property value of the attribute-value centering in set Similarity, computation attribute-value to frequency score SfrequencyIncluding:
Computation attribute-value is to the property value of any one attribute-value centering in set and the property value of other attribute-value centerings Between similarity;
Compare each between the property value of any one attribute-value centering and the property value of other attribute-value centerings Similarity and a predetermined threshold ε, and count number n of the similarity more than the value of the threshold epsilon;
The frequency score S of any one attribute-value pair is calculated based on below equationfrequency
<mrow> <msub> <mi>S</mi> <mrow> <mi>f</mi> <mi>r</mi> <mi>e</mi> <mi>q</mi> <mi>u</mi> <mi>e</mi> <mi>n</mi> <mi>c</mi> <mi>y</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mi>n</mi> <mi>N</mi> </mfrac> </mrow>
Wherein N is number of the attribute-value to attribute-value pair in set, and 1 >=ε >=0.5.
7. method as claimed in claim 2, wherein calculating the evidence score value S of the attribute-value pairevidenceBe based on attribute- Value is to the property value of the attribute-value centering in set and between the existing object value of other objects of the object same domain Average hybrid similarity.
8. the method as described in claim 1, wherein compare attribute-value to the quantity N of the attribute-value pair in set with it is predetermined Maximum-norm threshold value tL and smallest size threshold value sL, all properties-value is adaptively filtered to execution to be included with eliminating noise:
If N >=tL, retain attribute-value pair of all properties-value after sequence to preceding percentage x in set;
If N≤sL, retain attribute-value pair of all properties-value after sequence to preceding percentage y in set;Or
In the case where being unsatisfactory for above two conditions, retain category of all properties-value after sequence to preceding percentage z in set Property-value pair,
Wherein x, y, z ∈ [0,1] and y >=z >=x.
9. a kind of system of the property value of the isomery example of align objects, including:
Attribute-name normalizing block, the attribute-name to the attribute-value pair of the isomery example of same target is performed at the specification of attribute Reason obtains domain features;
Be worth order module, to belong to obtained domain features under attribute-value to all properties-value in set to arranging Sequence;And
Property value normalizing block, compares quantity N and predetermined maximum-norm threshold of the attribute-value to the attribute-value pair in set Value tL and smallest size threshold value sL, is adaptively filtered to eliminate noise, and to filtration treatment to all properties-value to execution The property value execution value of attribute-value pair afterwards is extracted, so that the object value of the object is therefrom selected,
Property value execution value wherein to the attribute-value pair after filtration treatment is extracted, so as to therefrom select the object of the object Value includes:
According to average length len of the formula below computation attribute-value to the property value of the attribute-value pair in setavg
<mrow> <msub> <mi>len</mi> <mrow> <mi>a</mi> <mi>v</mi> <mi>g</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&amp;Sigma;len</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> <msub> <mi>S</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> </mrow>
Its median length lenvalueIt is the number of word in property value character string, SvalueIt is the affiliated attribute-value pair of correspondence property value Total score;
According to below equation computation attribute-value to each word in the property value of the attribute-value pair in set in all properties value The frequency tf of middle appearancewordTo calculate the score value S of each wordword
<mrow> <msub> <mi>S</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>tf</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> </mrow> </msub> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mo>&amp;Element;</mo> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </munder> <msub> <mi>S</mi> <mrow> <mi>v</mi> <mi>a</mi> <mi>l</mi> <mi>u</mi> <mi>e</mi> </mrow> </msub> </mrow>
Wherein SvalueIt is the total score of the affiliated attribute-value pair of property value where a word;
Public substring of the attribute-value to the character string of the property value of the attribute-value pair in set is extracted, and it is public by what is extracted Length is less than or equal to len in word stringavgThe candidate value as object value;And
To all word score value S in the character string of each candidate valuewordSum to obtain the fraction S of each candidate valuecandidate, and will Highest score ScandidateCorresponding candidate value is used as final object value.
CN201210166855.2A 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances Active CN103425711B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210166855.2A CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances
JP2013109182A JP2013246826A (en) 2012-05-25 2013-05-23 Attribute values alignment system for differently structured object instances, method and program of attribute values alignment system for differently structured object instances

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210166855.2A CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances

Publications (2)

Publication Number Publication Date
CN103425711A CN103425711A (en) 2013-12-04
CN103425711B true CN103425711B (en) 2017-08-25

Family

ID=49650466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210166855.2A Active CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances

Country Status (2)

Country Link
JP (1) JP2013246826A (en)
CN (1) CN103425711B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778205B (en) * 2015-03-09 2019-02-15 浙江大学 A kind of mobile application sequence and clustering method based on Heterogeneous Information network
CN104965869A (en) * 2015-06-09 2015-10-07 浙江大学 Mobile application sorting and clustering method based on heterogeneous information network
CN106202041B (en) * 2016-07-01 2019-07-09 北京奇虎科技有限公司 A kind of method and apparatus of entity alignment problem in solution knowledge mapping
CN107807939B (en) * 2016-09-09 2021-12-28 阿里巴巴集团控股有限公司 Data object sorting method and device
CN110147487B (en) * 2017-10-17 2023-07-04 阿里巴巴华南技术有限公司 Method and system for determining object heat and processing equipment
CN111459990B (en) * 2020-03-31 2021-07-06 腾讯科技(深圳)有限公司 Object processing method, system, computer readable storage medium and computer device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829778B1 (en) * 2000-11-09 2004-12-07 Koninklijke Philips Electronics N.V. Method and system for limiting repetitive presentations based on content filtering
CN1716259A (en) * 2004-05-14 2006-01-04 微软公司 Method and system for ranking objects based on intra-type and inter-type relationships
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129631B (en) * 2010-01-13 2015-04-22 阿里巴巴集团控股有限公司 Method, equipment and system for SPU attribute aggregation
CN102402535A (en) * 2010-09-13 2012-04-04 阿里巴巴集团控股有限公司 Method and system for constructing product library

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829778B1 (en) * 2000-11-09 2004-12-07 Koninklijke Philips Electronics N.V. Method and system for limiting repetitive presentations based on content filtering
CN1716259A (en) * 2004-05-14 2006-01-04 微软公司 Method and system for ranking objects based on intra-type and inter-type relationships
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Web数据集成中数据清洗的关键问题研究;张好军;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100515;全文 *
基于结构和视觉特征的网页信息抽取技术的研究与实现;朱凯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080715;全文 *
面向Web数据集成的实体统一技术研究;孔青;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100815;论文第二章第2.1节,第三章第3.3节,第四章第4.3.4节,第五章第5.1节及第5.4节 *

Also Published As

Publication number Publication date
JP2013246826A (en) 2013-12-09
CN103425711A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN103425711B (en) Object value alignment schemes based on many object instances
CN106708966B (en) Junk comment detection method based on similarity calculation
US20050165736A1 (en) Methods for document indexing and analysis
CN103365998B (en) A kind of similar character string search method
CN102262642B (en) Web image search engine and realizing method thereof
EP2465054A1 (en) Method for scoring items using one or more ontologies
CN102495892A (en) Webpage information extraction method
CN101566997A (en) Determining words related to given set of words
CN109934278B (en) High-dimensionality feature selection method for information gain mixed neighborhood rough set
Prokić et al. Recognising groups among dialects
Dias et al. Fitting isochrones to open cluster photometric data-II. Nonparametric open cluster membership likelihood estimation and its application in optical and 2MASS near-IR data
US7877403B2 (en) System and method for database searching using fuzzy rules
Ramkumar et al. Scoring products from reviews through application of fuzzy techniques
CN103246685B (en) The method and apparatus that the attribution rule of object instance is turned to feature
CN112417152A (en) Topic detection method and device for case-related public sentiment
Caruso et al. Telcordia's database reconciliation and data quality analysis tool
CN103020083A (en) Automatic mining method of requirement identification template, requirement identification method and corresponding device
Khan et al. SwICS: Section-wise in-text citation score
JP4426041B2 (en) Information retrieval method by category factor
CN112784049B (en) Text data-oriented online social platform multi-element knowledge acquisition method
CN111651477B (en) Multi-source heterogeneous commodity consistency judging method and device based on semantic similarity
Onyancha et al. ASSESSING RESEARCHERS'PERFORMANCE IN DEVELOPING COUNTRIES: IS GOOGLE SCHOLAR AN ALTERNATIVE?
CN109144999B (en) Data positioning method, device, storage medium and program product
Cysouw et al. Analyzing feature consistency using dissimilarity matrices
CN113792726B (en) Method and system for rapidly generating POI (Point of interest) based on visual image

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant