CN103425711B - Object value alignment schemes based on many object instances - Google Patents
Object value alignment schemes based on many object instances Download PDFInfo
- Publication number
- CN103425711B CN103425711B CN201210166855.2A CN201210166855A CN103425711B CN 103425711 B CN103425711 B CN 103425711B CN 201210166855 A CN201210166855 A CN 201210166855A CN 103425711 B CN103425711 B CN 103425711B
- Authority
- CN
- China
- Prior art keywords
- value
- attribute
- mrow
- pair
- property
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention provides a kind of method of the property value of the isomery example of align objects, including:Attribute-name to the property value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;To belong to obtained domain features under property value to all properties value in set to being ranked up;And suitable public substring is selected from all properties value of all properties value centering after sequence as the object value of the object.Present invention also offers a kind of system of the property value of the isomery example of align objects.
Description
Technical field
The present invention relates to the method and system that a kind of property value of multiple isomery examples to object is alignd.
Background technology
With the popularization of internet, people obtain oneself resource interested by internet more and more, and will money
Source is arranged, the need for meeting oneself.There are the web page contents illustrated to the specification of various products on internet, these
The attribute and property value of various products is generally expressly recited in web page contents.People, can be right in order to obtain these content resources
The attribute and property value of these product objects carry out information extraction, and set up object database based on the information extracted.But
It is that the attribute and property value of the object are being indicated for same target (i.e. a kind of product) by different internet web page providers
When, the language that is used, wording, attribute number and property value form in terms of all have differences, and object is produced
Comment, sequence and the description page of product largely exist on the internet, and this just constitutes same target exists on the internet
The example (webpage or content page that describe each attribute of the object) of isomery (heterogeneous).How from numerous and jumbled
Internet resources in extract specific area (domain) due to the spy of isomery instance objects (object) caused by above-mentioned difference
Levying (feature) can facilitate the data content oneself utilized to be a kind of desirable technology to be integrated into.
The Chinese Patent Application No. that present applicant was submitted for 2 months to Patent Office of the People's Republic of China on the 14th in 2012
201210032507.6 are just referred to the processing that the attribute progress to various isomery examples can be clustered.Disclosure of which is overall
Completely include by reference in this application.Therefore, by the way of disclosed by above-mentioned patent application to object or
The isomery example of product is carried out after domain features cluster, it is necessary to which the property value after the cluster is further processed to obtain
A kind of representational value.It is exactly specifically to property value carry out value sequence and value standardization.Most prior arts focus on spy
Determine field, realm information is also difficult to collect, it is necessary to substantial amounts of manpower, but this kind of method generally yields good result.On
From isomeric data concentrate choose most representational one (or multiple) technology more come across query expansion or image procossing neck
Domain.Because target data set is different, sequence and the method extracted are also not quite similar.United States Patent (USP) US8035855B " Automatic
Selection of a subset representative pages from a multi-page document " are provided
A kind of method for choosing the most representative page automatic from multi-page document.United States Patent (USP) US6728704B " Method and
Apparatus for merging result lists from multiple search engines " provide one kind of knowing clearly
Merge the method and system of multiple search engine the results list;American invention discloses US20110145289 A1 " System and
Method For Generating A Pool of Matched Content " disclose a kind of side for generating matching content pond
Method and system.However, these inventions would generally be specific to a certain field or language, without universality.Accordingly, it is desirable to carry
Carry out processing to the property value after the cluster to obtain a kind of representative value and obtain for a kind of unrelated field and language pair
The method of acceptable result precision.
The content of the invention
In view of the above-mentioned problems in the prior art and make the present invention.The present invention relates generally to information processing and letter
Breath integrates correlation technique, and the method alignd of property value more particularly, to multiple isomery examples to object and is
System, i.e. after standardizing to the attribute of multiple isomery examples of object, from numerous property values of same standardization attribute
The method and system of one most representational property value (or multiple) of middle selection or generation.
According to an aspect of the invention, there is provided a kind of method of the property value of the isomery example of align objects, including:
Attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;To belonging to
Attribute-value under the domain features obtained is to all properties-value in set to being ranked up;And from all after sequence
Suitable public substring is selected in all properties value of attribute-value centering as the object value of the object.
According to one embodiment of the present invention, to belong to obtained domain features under attribute-value in set
All properties-value to be ranked up including:Based on source of the attribute-value to the object instance of each attribute-value pair in set
Calculate the importance score value of the attribute-value pair;It is every to the Similarity Measure between the attribute-value pair in set based on attribute-value
One attribute-value pair apart from score value;Based on attribute-value to the similarity between the property value of the attribute-value centering in set, meter
Calculate the frequency score of attribute-value pair;It is to the property value of the attribute-value centering in set and same with the object based on attribute-value
Similarity between the existing object value of other objects in field, calculates the evidence score value of the attribute-value pair;And be based on
At least two score values in above-mentioned calculated score value, perform weighted sum, to calculate the attribute-value to each in set
The total score of attribute-value pair.
According to one embodiment of the present invention, based on attribute-value to the similarity meter between the attribute-value pair in set
Calculate including apart from score value for each attribute-value pair:Any one attribute-value centering is calculated by comparing the method for character string
The similarity of attribute-name and domain features;Computation attribute-value belongs to any one attribute-value pair described in set with other
Property-average hybrid similarity of the value between;Following weighted sum meter is performed to the similarity and average hybrid similarity calculated
Calculate obtain any one attribute-value pair apart from score value.
According to one embodiment of the present invention, based on attribute-value between the property value of the attribute-value centering in set
Similarity, computation attribute-value to frequency score include:Computation attribute-value is to any one attribute-value centering in set
Property value and other attribute-value centerings property value between similarity;Compare the category of any one attribute-value centering
Property value and other attribute-value centerings property value between each similarity and a predetermined threshold, and it is big to count similarity
In the number of the value of the threshold value;Calculate counted number and account for ratio of the attribute-value to the number of attribute-value pair in set.
According to one embodiment of the present invention, the evidence score value for calculating the attribute-value pair is to collection based on attribute-value
The property value of attribute-value centering in conjunction and being averaged between the existing object value of other objects of the object same domain
Hybrid similarity.
According to one embodiment of the present invention, selected from all properties value of all properties after sequence-value centering
Suitable public substring includes as the object value of the object:Compare number of the attribute-value to the attribute-value pair in set
Amount and predetermined maximum-norm threshold value and smallest size threshold value, adaptively filter to execution to all properties-value and are made an uproar with eliminating
Sound;And the property value execution value of the attribute-value pair after filtration treatment is extracted, so as to therefrom select suitable public sub- character
The object value gone here and there as the object.
According to one embodiment of the present invention, compare attribute-value to the quantity of the attribute-value pair in set with it is predetermined
Maximum-norm threshold value tL and smallest size threshold value sL, all properties-value is adaptively filtered to execution to be included with eliminating noise:
If N >=tL, retain attribute-value pair of all properties-value after sequence to preceding percentage x in set;If N≤sL, retain
The attribute-value pair of all properties-value after sequence to percentage y before in set;Or in the situation for being unsatisfactory for above two conditions
Under, retain attribute-value pair of all properties-value after sequence to preceding percentage z in set, wherein x, y, z ∈ [0,1] and y >=z
≥x。
According to one embodiment of the present invention, the property value execution value to the attribute-value pair after filtration treatment is extracted, from
And therefrom select suitable public substring to include as the object value of the object:Computation attribute-value is to the category in set
Property-value to property value average length;Computation attribute-value is to each word in the property value of the attribute-value pair in set at this
The frequency occurred in all properties value calculates the score value of each word;Extract attribute of the attribute-value to the attribute-value pair in set
The public substring of the character string of value, and length in the public word string extracted is less than or equal to lenavgThe time as object value
Choosing value;And the fraction of each candidate value obtained to all word score values summation in the character string of each candidate value, and by highest
Candidate value corresponding to fraction is used as final object value.
According to another aspect of the present invention there is provided a kind of system of the property value of the isomery example of align objects, bag
Include:Attribute-name normalizing block, the attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing
Obtain domain features;Be worth order module, to belong to obtained domain features under attribute-value to all properties in set-
Value is to being ranked up;And property value normalizing block, selected from all properties value of all properties after sequence-value centering
Suitable public substring as the object object value.
By reading the detailed description of preferred embodiment of the invention below being considered in conjunction with the accompanying, this is better understood with
Above and other target, feature, advantage and the technology and industrial significance of invention.
Brief description of the drawings
Shown in Figure 1A and 1B is to implement the illustrative example that the present invention is targeted object instance.
Shown in Fig. 2 is the method alignd according to the property value of multiple isomery examples to object of the present invention and is
System general illustration.
Shown in Fig. 3 is the schematic diagram handled according to the execution " value alignment " of the present invention.
The illustrative diagram that an eigenmatrix is obtained after " value is alignd " processing is performed according to the present invention shown in Fig. 4
The method that the property value that Fig. 5 show multiple isomery examples according to the present invention to object is alignd is shown
Meaning property block diagram.
Fig. 6 show the method ensemble stream alignd according to the property value of multiple isomery examples to object of the present invention
Cheng Tu.
Shown in Fig. 7 is the flow chart calculated apart from fraction of the execution attribute-value pair according to the present invention.
Shown in Fig. 8 is the flow chart calculated according to the frequency score of the execution attribute-value pair of the present invention
Shown in Fig. 9 is to carry out the total of standardization processing according to the property value to the attribute-value pair after sequence of the present invention
Body flow chart.
Shown in Figure 10 is the property value carry out value extraction to the attribute-value pair after adaptive filtering processing according to the present invention
Flow chart.
Shown in Figure 11 is an experimental result according to the inventive method.
Embodiment
The embodiment of the present invention is described below in conjunction with the accompanying drawings.
Shown in Figure 1A and 1B is to implement the illustrative example that the present invention is targeted object instance, to the present invention
The term being previously mentioned is explained, to facilitate skilled artisan understands that the present invention.But, example exemplified herein is simultaneously
The object instance of the present invention is not defined.As shown in Figure 1A and 1B, by " exemplified by Ricoh CX5 " camera product objects,
List two object instances from different network address, i.e. example 1 and example 2.As shown in Figure 1A and 1B, the object (Ricoh
CX5 the description that 2 pairs of objects of example 1) and example are carried out is significantly different, with different wording, style and structure, therefore,
The example of this describing mode having differences is referred to as " isomery " example of object by applicant.Referring to shown in Figure 1A and 1B, often
Attribute and property value, i.e. attribute-value pair are generally comprised in the description of individual object." attribute " can be the thing for description object
Rationality matter or functional character, and " property value " or " value " is the specific descriptions to attribute.Because each object is generally with more
The attribute of aspect, therefore also there are multiple attribute-values pair.Example 1 in Figure 1A for example has the property that:Optical sensor,
Aperture, flash type, pixel, size of display and optical zoom multiple etc..Corresponding to these attributes, respectively with corresponding
Property value.Equally, also have the property that-be worth in Figure 1B example 2 pair, it is numerous to list herein, refer to accompanying drawing.Need
, it is noted that although attribute generally occurs in pairs with property value, some wherein property values can be sky.For institute in Fig. 1
Object " the Ricoh CX5 ", " Effective Pixels (valid pixel) "-" million of Approximately 10.00 shown
Pixels (about 10,000,000 pixel) " and " Weight (weight) "-" Approx.197g (about 197 grams) " are all attribute-values pair.
It is of the invention to mention term " characteristics of objects " below." characteristics of objects " that the present invention is previously mentioned be by it is multiple semantically
" attribute " of similar object instance attribute obtained from by integrating cluster, herein using " characteristics of objects " or " feature " come area
Not in " attribute " of initial object example.For example feature " Resolution " can represent entitled " Resolution ",
" Effective pixels ", the attribute of " Megapixels " etc..It is exactly by multiple isomery objects in cluster of the present invention
The attribute with specified degree similitude in example is classified as a category feature.On how to " attribute " of object instance
Integrate cluster and obtain " characteristics of objects " alleged by the present invention, that is, integrate the attribute after cluster, existing can be used
What existing method, can also use everybody Chinese patent application for being submitted in 2012 to Patent Office of the People's Republic of China for 14th for 2 months of the application
Cluster mode disclosed by numbers 201210032507.6.But how to cluster is not the object of the invention to be inquired into, therefore, no
It is described in detail here.
In addition the present invention also refers to " specific area " this term.The present invention, which is previously mentioned " specific area ", can refer to one
Field belonging to specific product.For example, " " the Canon 5D Mark II " that Ricoh CX5 " and Canon produce above-mentioned
Belong to " digital camera " this specific area." smart mobile phone ", " navigator ", " aero-engine ", " economy car " etc.
Deng, all it is likely to become " specific area " of the present invention, can also referred to as " specific field ".It is involved under each specific field
And specific product is then " object of specific area ".
Shown in Fig. 2 is the method alignd according to the property value of multiple isomery examples to object of the present invention and is
System general illustration.Specifically, be namely based on user it should be understood that specific area, scanned for via internet, obtain
It is to be understood that specific area object instance (describing the webpage illustrated to the specification of object), it is real to the object of specific area
All properties in example carry out specification of attribute processing, that is, carry out clustering processing, so that obtain each object of specific area
Domain features.The domain features obtained are to carrying out at cluster to the attribute-value of the initial object instance for obtaining specific area
The integrated results of attribute-value pair are obtained after reason.Based on the domain features obtained, " value alignment " processing is performed, to set up one
Simple visible object-characteristic relation is planted, so that the structure applied to object database.
Shown in Fig. 3 is the schematic diagram handled according to the execution " value alignment " of the present invention.Because domain features are by multiple
Hierarchical cluster attribute and obtain, so some feature of an object there may be the attribute-value pair of many isomeries, as shown in Figure 3.Hold
Row " value alignment " processing is in order to from the property value of multiple isomery attribute-values pair of a feature (such as feature 3 in Fig. 3
Property value) the middle brief and correct representative property value of selection one, or multiple isomery attribute-values pair based on a feature
Property value calculate a brief and correct representative property value.
By performing " value alignment " processing, the representative attribute of multiple features of each object in multiple objects is resulted in
Value, so as to obtain a characteristics of objects matrix.Being performed according to the present invention after " value is alignd " is handled shown in Fig. 4 obtains a feature
The illustrative diagram of matrix.Wherein specific area is " smart mobile phone ".Wherein longitudinal coordinate is object (the various product in field
The smart mobile phone of board and model), lateral coordinates are domain features.Element in matrix is value of the object under feature.Structuring
Eigenmatrix be object database a kind of exhibition method, can support more compare and statistics in terms of application.
The method that the property value that Fig. 5 show multiple isomery examples according to the present invention to object is alignd is shown
Meaning property block diagram.Value order module S1 is ranked up according to multiple characteristics to property value.It is worth normalizing block S2 to ranked
Property value is filtered and extracted processing and obtains the property value of characteristics of objects (to be different from initial attribute value, " object mentioned below
Value " is final acquisition property value).The input of the system obtains domain features by object instance and by the specification of attribute.
Its final output is object value.Intermediate result is ordering property value (hereinafter referred to as " ranking value ").
Fig. 6 show the method ensemble stream alignd according to the property value of multiple isomery examples to object of the present invention
Cheng Tu.Wherein step S11-S15 be as the value sequencer procedure performed by value order module S1, and step S21-S22 be by value advise
Value process of normalization performed by generalized module S2.
Value sequencer procedure is actually to being ranked up to all properties-value corresponding in each feature of an object
Process.The value sequencer procedure includes:Calculate in step s 11 the importance scores of each attribute-value pair in a feature,
Calculated in step S12 each attribute-value pair in a feature apart from fraction, in step s 13 calculate each in a feature
The frequency score of the similar attribute-value pair of attribute-value centering property value, the evidence point for calculating in step S14 a feature
Count and gross score is calculated to above-mentioned fraction weighted sum in step S15.It is not present successively between wherein step S11-S14
Sequentially.In implementation process of the present invention, user can be according to itself requirement to data precision, and whether decide in its sole discretion needs to perform
The combination of any one step or wherein any several steps in step S11-S14, and held based on selected step to adjust
The weighted value of gross score is calculated in row step S15.
Specifically, importance scores S is calculated in step s 11source, it is namely based on the importance imparting of attribute-value pair
One fractional value, the fractional value is a normalized numerical value.And the importance of the attribute-value pair comes from the attribute-value to institute
Whether the source of the object instance of category is important.If such as attribute-value to affiliated object instance user specific area
First is ordered as in search result, then assigns its importance scores for " 1 ", if arranged in the specific area search result of user
Sequence is second, then it is " 0.9 " etc. to assign its importance scores.Certainly, the occurrence of the importance scores can be according to user
The need for the system according to the present invention run original state when assignment rule is set.
Shown in Fig. 7 is the flow chart calculated apart from fraction of the execution attribute-value pair according to the present invention.
In step S12, calculate each feature it is properties-value to apart from score value.Specifically, in step S121, meter
Calculate the attribute-value centering attribute-name under a feature of an object and the similarity of the domain features.Specifically, at one
In the set of attribute-value pair under feature, the corresponding attribute-name for calculating any one attribute-value pair is similar to the domain features
Spend S1.It can be measured using arbitrary string similarity.In experiment using Dice distance metrics English text and
SmithWaterman distance metric Chinese texts.In view of the Similarity Measure belongs to techniques well known, therefore do not enter herein
Row description.
In step S122, the attribute-value under a feature of an object is calculated to any one attribute-value in set
Average hybrid similarity S couple with other attribute-values pair in the set2.Wherein hybrid similarity consider attribute-name similarity,
Property value similarity, attribute-name-property value intersect similarity.To the attribute-value pair from same object and different objects, mixing
The calculating of similarity generally uses different matching strategies.Attribute-value from same target is relatively more similar to generally having
Property value, therefore value similarity is more important for the attribute-value matching of same target, and to the attribute-value pair of different objects
Attribute-name similarity is more important for matching.But in this step, only for the attribute under the same feature of same target-
Value is to calculating.Specifically, for the Similarity value between attribute i and attribute j, for more than or equal to 0 and less than or equal to 1
Real number.The score value is obtained by three partial scores weighted sums.This most of score value be attribute-name similarity, property value similarity with
And intersect similarity.Attribute-name similarity is obtained by the text similarity between computation attribute name.This Similarity Measure is
Through being calculated in the prior art a variety of, therefore description is not repeated.Property value similarity passes through the distance between computation attribute value
Obtain.After the specification of attribute (clustering processing), property value is generally made up of numerical value and linear module.To this property value
Text similarity is not used to measure, but whether directly compare numerical value by Conversion of measurement unit equal, so as to carry out more accurate
Match somebody with somebody.If property value remains as character string, computation attribute value can be carried out in the way of above-mentioned computation attribute title similarity
Between similarity.Intersecting similarity is obtained by the similarity degree between metric attribute name and value, to excavate more
Many occurrences, for example attribute-value is to " Pixels:18000000 " and " Resolution:18megapixels " merely enters
Row attribute-name is matched and value matching all obtains the similarity of very little, and actually the two attribute-values be to that should be occurrence, if
Cross-matched attribute-name " Pixels " and value " 18megapixels " are then very easy to find similar.Based on above three Similarity value
One attribute-value pair of acquisition is weighted similar to the mixing between each attribute-value pair of other in set to attribute-value
Degree, wherein when being weighted, the weights of each similarity can also can have with equal according to the particular content of feature
Institute is different, but three weights and be 1.All hybrid similarities calculated are averaged, so as to obtain an attribute-value
To the average hybrid similarity S in this feature2。
In step S123, for any one attribute-value pair in a feature, based on the attribute calculated in step S121
Name and the similarity S of domain features1With the average hybrid similarity S calculated in step S1222, perform weighted sum and calculate, be somebody's turn to do
Attribute-value pair apart from score value:
Sdistance=w1S1+w2S2
Wherein w1,w2∈ [0,1] and w1+w2=1, usual w1≥w2, behind the present invention in given EXPERIMENTAL EXAMPLE,
w1=w2=0.5.
Shown in Fig. 8 is the flow chart calculated according to the frequency score of the execution attribute-value pair of the present invention.
As shown in figure 8, calculate frequency score in step s 13, it is popular for, be exactly computation attribute-value to appointing in set
The property value of attribute-value pair of anticipating is higher than time of a predetermined threshold with the property value similarity of other attribute-values pair in the set
Number.Step S13 includes step S131-S133.
In step S131, computation attribute-value is to the property value of any attribute-value pair in set with respect to other property values
Similarity.Specifically, for the property value of nonumeric type, similarity is calculated using similarity of character string matching process.It is right
In Numeric Attributes value, first determine whether whether linear module can mutually be changed.If convertible, contrast conversion after numerical value whether phase
Together, otherwise still measured using similarity of character string.
In step S132, if a threshold epsilon ∈ [0,1], generally makes ε >=0.5.The threshold value can be according to corresponding to the attribute
The unified degree that uses of property value set, if unified degree is higher, can be set to it is high a bit, for example
0.7 or 0.8, if unified degree is relatively low, can be set to it is slightly lower, such as 0.6.Threshold value based on the setting, system
The number n for the value for being more than the threshold epsilon in all similarities for a property value calculated in step S131 is counted, wherein
Comprising the property value with itself be compared obtain similarity be 1, therefore, be necessarily included in number n.
In step S133, normalized is performed according to the following equation, obtains a certain attribute-value pair in a feature
The frequency score S of property valuefrequency:
Wherein N be all properties-value under a feature to number.
In step S14, evidence score value is calculated.Generally, in same specific area under same domain features, different objects
The property value of attribute-value pair has certain similitude.Identical spy based on other objects calculated according to the present invention
Levy lower object value, calculate similarity between the property value and existing object value of the attribute-value pair of current signature, and be derived from
Average hybrid similarity S between other object valuesevidence.Calculate mode and the step S122 calculating of average hybrid similarity
Mode is identical, therefore is not repeated herein.
Finally, in sequence step S1, perform step S15 come calculate current attribute-value to total score Svalue.It is specific and
Speech, is exactly to perform weighted sum to aforementioned four score value using following formula to calculate:
Svalue=wsSsource+wdSdistance+wfSfrequency+weSevidence
Wherein ws,wd,wf,we∈ [0,1] and ws+wd+wf+we=1, usual wd≥wf≥we≥ws。
For current signature all properties-value to perform above-mentioned steps S1, be derived from all properties-value of this feature
To total score SvalueThe attribute-value pair with putting in order is obtained so as to be sorted according to different score values.
In order to be that the feature of standardization assigns a final object value (property value standardized), therefore also need to pair
The property value of the attribute-value pair of all sequences carries out standardization processing.
Shown in Fig. 9 is to carry out the total of standardization processing according to the property value to the attribute-value pair after sequence of the present invention
Body flow chart.As shown in figure 9, to value normalizing block S2 input sorted after attribute-value pair and its it is respective always
Fraction.In the step s 21, the noise figure in the property value after sequence is filtered through by adaptive filtering.Then, in step
In S22, execution value extracts processing and obtains final object value.
Generally, in a data acquisition system, inevitably there is noise.These so-called noises include such as mistake
Data, data that the set should not be belonged to etc..It is more accurate that the value extracted to perform public substring in step S22 extracts result
Really, it is necessary to reduce noise data beforehand through adaptive filtering.Therefore, in step S21, to perform adaptive filtering, there is provided
Two predetermined thresholds are advised to define the lower limit of maximum-norm and the upper limit of smallest size, i.e. maximum-norm threshold value tL and minimum respectively
Mould threshold value sL.The integer more than 25 and the integer less than 15 have been respectively adopted in the EXPERIMENTAL EXAMPLE being described below as threshold value
TL and sL.The setting of the two threshold values can be self-defined according to the scale of data set.According to embodiments of the present invention using such as lower section
Formula carries out adaptive filtering.
Assuming that the number of attribute-value pair is N, then
As N >=tL, reserved property-value abandons remaining category to the attribute-value pair of percentage x before the sequence in set
Property-value pair;
As N≤sL, reserved property-value abandons remaining category to the attribute-value pair of percentage y before the sequence in set
Property-value pair;
If being unsatisfactory for above-mentioned two condition, attribute-value of the reserved property-value to percentage z before the sequence in set
Pair and abandon remaining attribute-value pair.
Wherein, x, y, z ∈ [0,1] and y >=z >=x.X=60%, y=80%, z=70% are configured in experiment.
Shown in Figure 10 is the property value carry out value extraction to the attribute-value pair after the processing of adaptive misgivings according to the present invention
Flow chart.
As shown in Figure 10, in step S221, the attribute-value after the processing of adaptive misgivings is calculated using below equation
To the average length len of all properties value in setavg:
The length len of its intermediate valuevalueIt is the number of word in property value character string, SvalueIt is the total score of each attribute-value pair
Value.Afterwards, in step S222, using all words in the property value set of following formula calculating composition attribute-value pair each
The fractional value S of the number of times occurred in setword:
Wherein tfwordThe word frequency for being word in value set.If a word occurs in multiple property values, to the word institute
All properties-value of category to total score SvalueWord frequency tf is multiplied by after summationword, so as to obtain the fractional value of the word.Exist afterwards
In step S223, obtained using any existing method from the property value set of the set of the attribute-value pair after adaptive filtering
The public substring of property value character string is taken, and length in acquired public substring is less than or equal to acquisition calculated above
The average length len of all properties valueavgPublic substring as candidate target value (last attribute value i.e. corresponding to feature
Candidate value).Then, in step S224, calculated according to following formula based on all words in each candidate's public substring
Fractional value SwordCarry out the fraction that summation obtains each candidate's public substring:
Finally, in step S225, choose fraction highest candidate value in public substring as object value.
Shown in Figure 11 is an experimental result according to the inventive method (language is respectively English and Chinese).Show in figure
The result for showing property value sequence is rational, and the object value after standardization meets optimization aim, i.e., briefly, correctly and with representative
Property.
The sequence of operations illustrated in the description can be held by the combination of hardware, software or hardware and software
OK.When performing this series of operation by software, computer program therein can be installed to the meter for being built in specialized hardware
In memory in calculation machine so that computer performs the computer program.Or, computer program can be installed to hold
In the all-purpose computer of the various types of processing of row so that computer performs the computer program.
For example, computer program can be prestored to the hard disk or ROM (read-only storage) as recording medium
In.Or, (record) computer program can be temporarily or permanently stored into removable recording medium, such as floppy disk, CD-
ROM (compact disc read-only memory), MO (magneto-optic) disk, DVD (digital versatile disc), disk or semiconductor memory.Can be this
The removable recording medium of sample is provided as canned software.
The present invention is described in detail by reference to specific embodiment.It may be evident, however, that in the essence without departing substantially from the present invention
In the case of god, those skilled in the art can perform to embodiment and change and replace.In other words, the shape that the present invention illustrates
Formula is disclosed, rather than is explained with being limited.Judge idea of the invention, it is contemplated that appended claim.
Claims (9)
1. a kind of method of the property value of the isomery example of align objects, including:
Attribute-name to the attribute-value pair of the isomery example of same target performs specification of attributeization processing acquisition domain features;
To belong to obtained domain features under attribute-value to all properties-value in set to being ranked up;And
Compare quantity N and predetermined maximum-norm threshold value tL and smallest size threshold value of the attribute-value to the attribute-value pair in set
SL, is adaptively filtered to eliminate noise to all properties-value to execution;
Property value execution value to the attribute-value pair after filtration treatment is extracted, so that the object value of the object is therefrom selected,
Property value execution value wherein to the attribute-value pair after filtration treatment is extracted, so as to therefrom select the object of the object
Value includes:
According to average length len of the formula below computation attribute-value to the property value of the attribute-value pair in setavg:
<mrow>
<msub>
<mi>len</mi>
<mrow>
<mi>a</mi>
<mi>v</mi>
<mi>g</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>&Sigma;len</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
<msub>
<mi>S</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
</mrow>
<mi>N</mi>
</mfrac>
</mrow>
Its median length lenvalueIt is the number of word in property value character string, SvalueIt is the affiliated attribute-value pair of correspondence property value
Total score;
According to below equation computation attribute-value to each word in the property value of the attribute-value pair in set in all properties value
The frequency tf of middle appearancewordTo calculate the score value S of each wordword:
<mrow>
<msub>
<mi>S</mi>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
</mrow>
</msub>
<mo>=</mo>
<msub>
<mi>tf</mi>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
</mrow>
</msub>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
<mo>&Element;</mo>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</munder>
<msub>
<mi>S</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
</mrow>
Wherein SvalueIt is the total score of the affiliated attribute-value pair of property value where a word;
Public substring of the attribute-value to the character string of the property value of the attribute-value pair in set is extracted, and it is public by what is extracted
Length is less than or equal to len in word stringavgThe candidate value as object value;And
To all word score value S in the character string of each candidate valuewordSum to obtain the fraction S of each candidate valuecandidate, and will
Highest score ScandidateCorresponding candidate value is used as final object value.
2. the method as described in claim 1, wherein to belong to obtained domain features under attribute-value to the institute in set
Have attribute-value to be ranked up including:
The importance of the attribute-value pair is calculated the source of the object instance of each attribute-value pair in set based on attribute-value
Score value Ssource;
Based on attribute-value to each attribute-value pair of Similarity Measure between the attribute-value pair in set apart from score value
Sdistance;
Based on attribute-value to the similarity between the property value of the attribute-value centering in set, computation attribute-value to frequency
Score value Sfrequency;
Based on attribute-value to the property value of the attribute-value centering in set and with other objects of the object same domain
Similarity between some object values, calculates the evidence score value S of the attribute-value pairevidence;And
Based at least two score values in above-mentioned calculated score value, weighted sum is performed, to calculate the attribute-value to set
In each attribute-value pair total score Svalue。
3. method as claimed in claim 2, wherein it is possible to calculate total score based on all four score values with equation below
Svalue:
Svalue=wsSsource+wdSdistance+wfSfrequency+weSevidence
Wherein ws,wd,wf,we∈ [0,1] and ws+wd+wf+we=1.
4. method as claimed in claim 3, wherein, wd≥wf≥we≥ws。
5. method as claimed in claim 2, wherein based on attribute-value to the similarity meter between the attribute-value pair in set
Calculate each attribute-value pair apart from score value SdistanceIncluding:
The attribute-name of any one attribute-value centering and the similarity S of domain features are calculated by comparing the method for character string1;
Computation attribute-value is to the average mixed phase described in set between any one attribute-value pair and other attribute-values pair
Like degree S2, wherein, similarity is intersected to attribute-name similarity, property value similarity and attribute-name-property value and is weighted meter
Calculate and obtain an attribute-value pair and attribute-value to the hybrid similarity between each attribute-value pair of other in set, to being counted
All hybrid similarities calculated are averaged, so as to obtain the average hybrid similarity S2;
To the similarity S calculated1With average hybrid similarity S2Perform following weighted sum and calculate any one described category of acquisition
Property-value to apart from score value Sdistance:
Sdistance=w1S1+w2S2
Wherein w1,w2∈ [0,1] and w1+w2=1, and w1≥w2。
6. method as claimed in claim 2, wherein based on attribute-value between the property value of the attribute-value centering in set
Similarity, computation attribute-value to frequency score SfrequencyIncluding:
Computation attribute-value is to the property value of any one attribute-value centering in set and the property value of other attribute-value centerings
Between similarity;
Compare each between the property value of any one attribute-value centering and the property value of other attribute-value centerings
Similarity and a predetermined threshold ε, and count number n of the similarity more than the value of the threshold epsilon;
The frequency score S of any one attribute-value pair is calculated based on below equationfrequency:
<mrow>
<msub>
<mi>S</mi>
<mrow>
<mi>f</mi>
<mi>r</mi>
<mi>e</mi>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>n</mi>
<mi>c</mi>
<mi>y</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mi>n</mi>
<mi>N</mi>
</mfrac>
</mrow>
Wherein N is number of the attribute-value to attribute-value pair in set, and 1 >=ε >=0.5.
7. method as claimed in claim 2, wherein calculating the evidence score value S of the attribute-value pairevidenceBe based on attribute-
Value is to the property value of the attribute-value centering in set and between the existing object value of other objects of the object same domain
Average hybrid similarity.
8. the method as described in claim 1, wherein compare attribute-value to the quantity N of the attribute-value pair in set with it is predetermined
Maximum-norm threshold value tL and smallest size threshold value sL, all properties-value is adaptively filtered to execution to be included with eliminating noise:
If N >=tL, retain attribute-value pair of all properties-value after sequence to preceding percentage x in set;
If N≤sL, retain attribute-value pair of all properties-value after sequence to preceding percentage y in set;Or
In the case where being unsatisfactory for above two conditions, retain category of all properties-value after sequence to preceding percentage z in set
Property-value pair,
Wherein x, y, z ∈ [0,1] and y >=z >=x.
9. a kind of system of the property value of the isomery example of align objects, including:
Attribute-name normalizing block, the attribute-name to the attribute-value pair of the isomery example of same target is performed at the specification of attribute
Reason obtains domain features;
Be worth order module, to belong to obtained domain features under attribute-value to all properties-value in set to arranging
Sequence;And
Property value normalizing block, compares quantity N and predetermined maximum-norm threshold of the attribute-value to the attribute-value pair in set
Value tL and smallest size threshold value sL, is adaptively filtered to eliminate noise, and to filtration treatment to all properties-value to execution
The property value execution value of attribute-value pair afterwards is extracted, so that the object value of the object is therefrom selected,
Property value execution value wherein to the attribute-value pair after filtration treatment is extracted, so as to therefrom select the object of the object
Value includes:
According to average length len of the formula below computation attribute-value to the property value of the attribute-value pair in setavg:
<mrow>
<msub>
<mi>len</mi>
<mrow>
<mi>a</mi>
<mi>v</mi>
<mi>g</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>&Sigma;len</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
<msub>
<mi>S</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
</mrow>
<mi>N</mi>
</mfrac>
</mrow>
Its median length lenvalueIt is the number of word in property value character string, SvalueIt is the affiliated attribute-value pair of correspondence property value
Total score;
According to below equation computation attribute-value to each word in the property value of the attribute-value pair in set in all properties value
The frequency tf of middle appearancewordTo calculate the score value S of each wordword:
<mrow>
<msub>
<mi>S</mi>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
</mrow>
</msub>
<mo>=</mo>
<msub>
<mi>tf</mi>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
</mrow>
</msub>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>w</mi>
<mi>o</mi>
<mi>r</mi>
<mi>d</mi>
<mo>&Element;</mo>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</munder>
<msub>
<mi>S</mi>
<mrow>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</msub>
</mrow>
Wherein SvalueIt is the total score of the affiliated attribute-value pair of property value where a word;
Public substring of the attribute-value to the character string of the property value of the attribute-value pair in set is extracted, and it is public by what is extracted
Length is less than or equal to len in word stringavgThe candidate value as object value;And
To all word score value S in the character string of each candidate valuewordSum to obtain the fraction S of each candidate valuecandidate, and will
Highest score ScandidateCorresponding candidate value is used as final object value.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210166855.2A CN103425711B (en) | 2012-05-25 | 2012-05-25 | Object value alignment schemes based on many object instances |
JP2013109182A JP2013246826A (en) | 2012-05-25 | 2013-05-23 | Attribute values alignment system for differently structured object instances, method and program of attribute values alignment system for differently structured object instances |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210166855.2A CN103425711B (en) | 2012-05-25 | 2012-05-25 | Object value alignment schemes based on many object instances |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103425711A CN103425711A (en) | 2013-12-04 |
CN103425711B true CN103425711B (en) | 2017-08-25 |
Family
ID=49650466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210166855.2A Active CN103425711B (en) | 2012-05-25 | 2012-05-25 | Object value alignment schemes based on many object instances |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2013246826A (en) |
CN (1) | CN103425711B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778205B (en) * | 2015-03-09 | 2019-02-15 | 浙江大学 | A kind of mobile application sequence and clustering method based on Heterogeneous Information network |
CN104965869A (en) * | 2015-06-09 | 2015-10-07 | 浙江大学 | Mobile application sorting and clustering method based on heterogeneous information network |
CN106202041B (en) * | 2016-07-01 | 2019-07-09 | 北京奇虎科技有限公司 | A kind of method and apparatus of entity alignment problem in solution knowledge mapping |
CN107807939B (en) * | 2016-09-09 | 2021-12-28 | 阿里巴巴集团控股有限公司 | Data object sorting method and device |
CN110147487B (en) * | 2017-10-17 | 2023-07-04 | 阿里巴巴华南技术有限公司 | Method and system for determining object heat and processing equipment |
CN111459990B (en) * | 2020-03-31 | 2021-07-06 | 腾讯科技(深圳)有限公司 | Object processing method, system, computer readable storage medium and computer device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6829778B1 (en) * | 2000-11-09 | 2004-12-07 | Koninklijke Philips Electronics N.V. | Method and system for limiting repetitive presentations based on content filtering |
CN1716259A (en) * | 2004-05-14 | 2006-01-04 | 微软公司 | Method and system for ranking objects based on intra-type and inter-type relationships |
CN101286156A (en) * | 2007-05-29 | 2008-10-15 | 北大方正集团有限公司 | Method for removing repeated object based on metadata |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102129631B (en) * | 2010-01-13 | 2015-04-22 | 阿里巴巴集团控股有限公司 | Method, equipment and system for SPU attribute aggregation |
CN102402535A (en) * | 2010-09-13 | 2012-04-04 | 阿里巴巴集团控股有限公司 | Method and system for constructing product library |
-
2012
- 2012-05-25 CN CN201210166855.2A patent/CN103425711B/en active Active
-
2013
- 2013-05-23 JP JP2013109182A patent/JP2013246826A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6829778B1 (en) * | 2000-11-09 | 2004-12-07 | Koninklijke Philips Electronics N.V. | Method and system for limiting repetitive presentations based on content filtering |
CN1716259A (en) * | 2004-05-14 | 2006-01-04 | 微软公司 | Method and system for ranking objects based on intra-type and inter-type relationships |
CN101286156A (en) * | 2007-05-29 | 2008-10-15 | 北大方正集团有限公司 | Method for removing repeated object based on metadata |
Non-Patent Citations (3)
Title |
---|
Web数据集成中数据清洗的关键问题研究;张好军;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100515;全文 * |
基于结构和视觉特征的网页信息抽取技术的研究与实现;朱凯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080715;全文 * |
面向Web数据集成的实体统一技术研究;孔青;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100815;论文第二章第2.1节,第三章第3.3节,第四章第4.3.4节,第五章第5.1节及第5.4节 * |
Also Published As
Publication number | Publication date |
---|---|
JP2013246826A (en) | 2013-12-09 |
CN103425711A (en) | 2013-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103425711B (en) | Object value alignment schemes based on many object instances | |
CN106708966B (en) | Junk comment detection method based on similarity calculation | |
US20050165736A1 (en) | Methods for document indexing and analysis | |
CN103365998B (en) | A kind of similar character string search method | |
CN102262642B (en) | Web image search engine and realizing method thereof | |
EP2465054A1 (en) | Method for scoring items using one or more ontologies | |
CN102495892A (en) | Webpage information extraction method | |
CN101566997A (en) | Determining words related to given set of words | |
CN109934278B (en) | High-dimensionality feature selection method for information gain mixed neighborhood rough set | |
Prokić et al. | Recognising groups among dialects | |
Dias et al. | Fitting isochrones to open cluster photometric data-II. Nonparametric open cluster membership likelihood estimation and its application in optical and 2MASS near-IR data | |
US7877403B2 (en) | System and method for database searching using fuzzy rules | |
Ramkumar et al. | Scoring products from reviews through application of fuzzy techniques | |
CN103246685B (en) | The method and apparatus that the attribution rule of object instance is turned to feature | |
CN112417152A (en) | Topic detection method and device for case-related public sentiment | |
Caruso et al. | Telcordia's database reconciliation and data quality analysis tool | |
CN103020083A (en) | Automatic mining method of requirement identification template, requirement identification method and corresponding device | |
Khan et al. | SwICS: Section-wise in-text citation score | |
JP4426041B2 (en) | Information retrieval method by category factor | |
CN112784049B (en) | Text data-oriented online social platform multi-element knowledge acquisition method | |
CN111651477B (en) | Multi-source heterogeneous commodity consistency judging method and device based on semantic similarity | |
Onyancha et al. | ASSESSING RESEARCHERS'PERFORMANCE IN DEVELOPING COUNTRIES: IS GOOGLE SCHOLAR AN ALTERNATIVE? | |
CN109144999B (en) | Data positioning method, device, storage medium and program product | |
Cysouw et al. | Analyzing feature consistency using dissimilarity matrices | |
CN113792726B (en) | Method and system for rapidly generating POI (Point of interest) based on visual image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |