CN103425711A - Object value aligning method based on multiple object instances - Google Patents

Object value aligning method based on multiple object instances Download PDF

Info

Publication number
CN103425711A
CN103425711A CN2012101668552A CN201210166855A CN103425711A CN 103425711 A CN103425711 A CN 103425711A CN 2012101668552 A CN2012101668552 A CN 2012101668552A CN 201210166855 A CN201210166855 A CN 201210166855A CN 103425711 A CN103425711 A CN 103425711A
Authority
CN
China
Prior art keywords
value
attribute
worth
similarity
pair set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101668552A
Other languages
Chinese (zh)
Other versions
CN103425711B (en
Inventor
姜珊珊
郑继川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201210166855.2A priority Critical patent/CN103425711B/en
Priority to JP2013109182A priority patent/JP2013246826A/en
Publication of CN103425711A publication Critical patent/CN103425711A/en
Application granted granted Critical
Publication of CN103425711B publication Critical patent/CN103425711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for aligning attribute values of isomerism instances of objects. The method includes that an attribute name of an attribute-value pair of the isomerism instance of the same object is subjected to attribute standardization processing to obtain a domain characteristic; all attribute-value pairs in an attribute-value pair set belonging to the obtained domain characteristic are sorted; suitable public substrings in all attribute values in all the sorted attribute-value pairs are selected to be used as object values of the objects. A system for aligning the attribute values of the isomerism instances of the objects is further provided.

Description

Object value alignment schemes based on the multi-object example
Technical field
The present invention relates to the method and system that a kind of property value of a plurality of isomery examples to object is alignd.
Background technology
Along with popularizing of internet, people obtain own interested resource by internet more and more, and resource is arranged, and meet the needs of oneself.The web page contents that exists the standard to various products to describe on internet, these web page contents are clearly put down in writing attribute and the property value of various products usually.People, in order to obtain these content resources, can carry out information extraction to attribute and the property value of these product objects, and the information based on extracted are set up object database.But, different internet web page providers for same target (being a kind of product) when the attribute of indicating this object and the property value, the number of the language adopted, wording, attribute and the form aspect of property value all there are differences, and the comment of object product, sequence and describe the page and exist in a large number on the internet, this has just formed the example (describing webpage or the content page of the various attributes of this object) that there be isomery (heterogeneous) on the internet in same target.Can to facilitate the data content oneself utilized be the needed a kind of technology of people in order to be integrated into how from numerous and jumbled Internet resources, to extract the feature (feature) of the isomery instance objects (object) caused due to above-mentioned difference of specific area (domain).
The Chinese Patent Application No. 201210032507.6 that the application's applicant submitted on February 14th, 2012 to Patent Office of the People's Republic of China with regard to mentioned attribute to various isomery examples carry out can cluster processing.The content whole of this application mode by reference is completely contained in the application.Therefore, after the isomery example that adopts the disclosed mode of above-mentioned patented claim to object or product carries out the domain features cluster, thereby need to be further processed the property value after this cluster, obtain a kind of representational value.Be exactly particularly to the sequence of property value value and value standardization.Most prior aries are paid close attention to specific area, and realm information also is difficult to collect, and need a large amount of manpowers, but these class methods obtains good result usually.About from isomeric data, concentrating the technology of choosing the most representative one (or a plurality of) to come across query expansion or image processing field more.Due to the target data set difference, sequence and the method extracted also are not quite similar.U.S. Pat 8035855B " Automatic selection of a subset representative pages from a multi-page document " provides a kind of method of the representative page of automatically choosing from the multipage document.U.S. Pat 6728704B " Method and apparatus for merging result lists from multiple search engines " provide a kind of method and system that merges multiple search engine the results list of knowing clearly; U.S. Pat 20110145289A1 " System and Method For Generating A Pool of Matched Content " has disclosed a kind of method and system that generates the matching content pond.Yet these patents of invention can, specific to a certain field or language, not have universality usually.Therefore, thus people need to provide a kind of irrelevant field and the right property value to after this cluster of language to be processed to be obtained a kind of representative value and obtains the acceptable method of precision as a result.
Summary of the invention
Make the present invention in view of the above-mentioned problems in the prior art.The present invention relates generally to integrate correlation technique with information processing and information, and more specifically, relate to the method and system that the property value to a plurality of isomery examples of object is alignd,, after the attribute of a plurality of isomery examples to object is standardized, select or generate the method and system of a property value the most representative (or a plurality of) from numerous property values of same standardization attribute.
According to an aspect of the present invention, provide a kind of method of property value of isomery example of align objects, having comprised: the right attribute-name of the attribute of the isomery example of same target-be worth has been carried out to specification of attribute processing and obtain domain features; To belonging to all properties-value in attribute under obtained domain features-value pair set to being sorted; And in all properties value of all properties-value centering from sequence the suitable public substring of selection as the object value of described object.
According to one embodiment of the present invention, all properties-value belonged in attribute under obtained domain features-value pair set is comprised sorting: the right importance score value of this attribute-be worth is calculated in the source of the right object instance of each attribute based in attribute-value pair set-be worth; Attribute-value based in attribute-value pair set between similarity calculate each attribute-be worth right apart from score value; Similarity between the property value of the attribute based in attribute-value pair set-value centering, the right frequency score of computation attribute-be worth; The property value of the attribute based in attribute-value pair set-value centering and and the existing object value of other object of described object same domain between similarity, calculate the right evidence score value of described attribute-be worth; And at least two score values in the score value based on above-mentioned calculated, carry out weighted sum, to calculate the right total score value of each attribute in described attribute-value pair set-be worth.
According to one embodiment of the present invention, the attribute-value based in attribute-value pair set between similarity calculate each attribute-be worth right comprising apart from score value: calculate any one attribute-attribute-name of value centering and the similarity of domain features by the method for compare string string; Described any one attribute-value in computation attribute-value pair set to and other attribute-values between average hybrid similarity; To calculated similarity and average hybrid similarity carry out following weighted sum calculate obtain described any one attribute-be worth right apart from score value.
According to one embodiment of the present invention, similarity between the property value of the attribute based in attribute-value pair set-value centering, computation attribute-be worth right frequency score to comprise: the similarity between the property value of any one attribute in computation attribute-value pair set-value centering and the property value of other attributes-value centering; Each similarity and a predetermined threshold threshold value between the property value of more described any one attribute-value centering and the property value of other attributes-value centering, and count the number that similarity is greater than the value of this threshold value; Calculate the number add up and account for attribute in attribute-value pair set-the be worth ratio of right number.
According to one embodiment of the present invention, calculate the right evidence score value of described attribute-be worth be based on attribute in attribute-value pair set-value centering property value and and the existing object value of other object of described object same domain between average hybrid similarity.
According to one embodiment of the present invention, in all properties value of all properties-value centering from sequence, select suitable public substring to comprise as the object value of described object: the relatively right quantity of the attribute in attribute-value pair set-be worth and predetermined maximum-norm threshold value and smallest size threshold value, all properties-value is filtered to eliminate noise adaptively to execution; And right property value execution value extracts to the attribute after filtration treatment-be worth, thereby therefrom select the object value of suitable public substring as described object.
According to one embodiment of the present invention, the relatively right quantity of the attribute in attribute-value pair set-be worth and predetermined maximum-norm threshold value tL and smallest size threshold value sL, all properties-value is filtered to eliminate noise adaptively to execution to be comprised: if N >=tL, all properties after the reservation sequence-attribute-it is right to be worth for first x of value centering; If N≤tS, all properties after the reservation sequence-attribute-it is right to be worth for first y of value centering; Or in the situation that do not meet two of fronts condition, retain after sequence all properties-attribute-it is right to be worth for first z of value centering, x wherein, y, z ∈ [0,1] and y >=z >=x.
According to one embodiment of the present invention, the property value execution value right to the attribute after filtration treatment-be worth extracts, thereby therefrom selects suitable public substring to comprise as the object value of described object: the attribute in computation attribute-value pair set-the be worth average length of right property value; Attribute in computation attribute-value pair set-the be worth frequency that in right property value, each word occurs in this all properties value is calculated the score value of each word; Attribute in extraction attribute-value pair set-be worth the public substring of the character string of right property value, and length in extracted public word string is less than or equal to len avgThe candidate value as object value; And in the character string to each candidate value, all word score values sue for peace to obtain the mark of each candidate value, and using the corresponding candidate value of highest score as final object value.
According to another aspect of the present invention, a kind of system of property value of isomery example of align objects is provided, comprise: attribute-name standardization module, the right attribute-name of the attribute of the isomery example of same target-be worth is carried out to specification of attribute processing and obtain domain features; The value order module, to belonging to all properties-value in attribute under obtained domain features-value pair set to being sorted; And property value standardization module, select the object value of suitable public substring as described object in all properties value of all properties from sequence-value centering.。
By reading the detailed description of following the preferred embodiments of the present invention of considering by reference to the accompanying drawings, will understand better above and other target of the present invention, feature, advantage and technology and industrial significance.
The accompanying drawing explanation
Shown in Figure 1A and 1B be implement the present invention be for an illustrative example of object instance.
Shown in Fig. 2 is the method and system general illustration that the property value according to a plurality of isomery examples to object of the present invention is alignd.
Shown in Fig. 3 is the schematic diagram of processing according to execution of the present invention " value alignment ".
Carrying out according to the present invention shown in Fig. 4 " value alignment " obtains the illustrative diagram of an eigenmatrix after processing
Figure 5 shows that the method schematic block diagram that the property value of a plurality of isomery examples to object according to the present invention is alignd.
Figure 6 shows that the method overview flow chart that the property value according to a plurality of isomery examples to object of the present invention is alignd.
Shown in Fig. 7 is the process flow diagram that apart from mark calculate right according to execution attribute of the present invention-be worth.
Shown in Fig. 8 is the process flow diagram that the frequency score right according to execution attribute of the present invention-be worth calculated
Shown in Fig. 9 is to attribute-the be worth overview flow chart that right property value carries out standardization processing after sequence according to of the present invention.
Shown in Figure 10 is according to attribute after self-adaptation misgivings are processed of the present invention-the be worth process flow diagram of right property value value extraction.
Shown in Figure 11 is an experimental result according to the inventive method.
Embodiment
Below in conjunction with accompanying drawing, the embodiment of the present invention is described.
Shown in Figure 1A and 1B be implement the present invention be for an illustrative example of object instance, the term that the present invention is mentioned makes an explanation, in order to facilitate those skilled in the art to understand the present invention.But the example that this place exemplifies is not limited object instance of the present invention.As shown in Figure 1A and 1B, " Ricoh CX5 " the camera product object of take is example, has enumerated two object instances from different network address, i.e. example 1 and example 2.As shown in Figure 1A and 1B, the description that the example 1 of this object (Ricoh CX5) and 2 pairs of these objects of example carry out is obviously different, have different wording, style and structure, therefore, the applicant is called the example of this describing mode there are differences " isomery " example of object.Shown in Figure 1A and 1B, usually comprise attribute and property value in the description of each object, attribute-it is right to be worth." attribute " can be for the physical property of description object or functional character, and " property value " or " value " is the specific descriptions to attribute.Therefore because each object has many-sided attribute usually, also have that a plurality of attributes-it is right to be worth.Example 1 in Figure 1A be the reasons are as follows attribute: optical sensor, aperture, flash type, pixel, size of display and optical zoom multiple etc.Corresponding to these attributes, there is respectively corresponding property value.Equally, also have in the example 2 of Figure 1B that such attribute-it is right to be worth, and does not enumerate at this, refers to accompanying drawing.Although it is pointed out that attribute and property value are generally occurring, some wherein property value can be sky.For the object shown in Fig. 1 " Ricoh CX5 ", " Effective Pixels(valid pixel) "-" about 1,000 ten thousand pixels of Approximately 10.00million pixels() " and " Weight(weight) "-" Approx.197g(is 197 grams approximately) " be all that attribute-it is right to be worth.
Can mention term " characteristics of objects " below the present invention." characteristics of objects " that the present invention mentions is the attribute obtained by integrating cluster by " attribute " of a plurality of semantically similar object instances, adopts " characteristics of objects " or " feature " to be different from " attribute " of initial object example herein.For example feature " Resolution " can represent " Resolution " by name, " Effective pixels ", the attribute of " Megapixels " etc.In cluster of the present invention, be exactly that the attribute with specified degree similarity in a plurality of isomery object instances is classified as to a category feature.About how " attribute " of object instance being integrated cluster and obtained the present invention's alleged " characteristics of objects ", integrate the attribute after cluster, can use existing any existing method, also can use everybody disclosed cluster mode of Chinese Patent Application No. 201210032507.6 from February 14th, 2012 to Patent Office of the People's Republic of China that submit to of the application.But how cluster is not the object that the present invention will inquire into, therefore, not in this detailed description.
The present invention has also mentioned " specific area " this term in addition.The present invention mentions " specific area " can a field that specific product is affiliated.For example, " Ricoh CX5 " above-mentioned and Canon produce " Canon 5D Mark II " all belongs to " digital camera " this specific area." smart mobile phone ", " navigating instrument ", " aeromotor ", " economy car " etc., all may become " specific area " of the present invention, also can be referred to as " specific field ".Under each specific field, related concrete product is " object of specific area ".
Shown in Fig. 2 is the method and system general illustration that the property value according to a plurality of isomery examples to object of the present invention is alignd.Particularly, just be based on the specific area that the user need to understand, to being searched for via internet, obtain the object instance (describing the webpage to the normalized illustration of object) of the specific area that will understand, all properties in the object instance of specific area is carried out to specification of attribute processing, carry out clustering processing, thereby obtain the domain features of each object of specific area.The domain features obtained is to carrying out after clustering processing obtaining the right integrated results of attribute-be worth to the attribute-value of the object instance of initial acquisition specific area.Domain features based on obtained, carry out " value alignment " and process, in order to set up a kind of simple visible object-characteristic relation, thereby is applied to the structure of object database.
Shown in Fig. 3 is the schematic diagram of processing according to execution of the present invention " value alignment ".Because domain features is to be obtained by a plurality of hierarchical cluster attributes, thus some features of an object may exist a lot of isomeries attribute-it is right to be worth, as shown in Figure 3.Carry out " value alignment " processing exactly for example, for a plurality of isomery attributes from a feature-be worth in right property value (property value of feature 3 Fig. 3) and choose a brief and correct representative property value, or a plurality of isomery attributes based on a feature-be worth right property value to calculate a brief and correct representative property value.
Process by carrying out " value alignment ", can obtain the representative property value of a plurality of features of each object in a plurality of objects, thereby obtain a characteristics of objects matrix.Carrying out according to the present invention shown in Fig. 4 " value alignment " obtains the illustrative diagram of an eigenmatrix after processing.Wherein specific area is " smart mobile phone ".Wherein along slope coordinate is the object (smart mobile phone of various brands and model) in field, and lateral coordinates is domain features.Element in matrix is the value of object under feature.Structurized eigenmatrix is a kind of exhibition method of object database, can support more relatively to reach the application of statistics aspect.
Figure 5 shows that the method schematic block diagram that the property value of a plurality of isomery examples to object according to the present invention is alignd.Value order module S1 is sorted to property value according to a plurality of characteristics.Value standardization module S2 is filtered and is extracted the property value (for being different from the initial attribute value, below mentioning " object value " and be final acquisition property value) of processing the acquisition characteristics of objects to the property value sorted.This system be input as object instance and through domain features that the specification of attributeization obtains.It finally is output as object value.Intermediate result is ordering property value (below be referred to as " ranking value ").
Figure 6 shows that the method overview flow chart that the property value according to a plurality of isomery examples to object of the present invention is alignd.Wherein step S11-S15 is the value sequencer procedure performed by value order module S1, and step S21-S22 is by the performed value process of normalization of value standardization module S2.
The value sequencer procedure is actually the process of corresponding all properties-value to being sorted in each feature of an object.This value sequencer procedure comprises: calculate each attribute in a feature-be worth right importance scores together in step S11, calculate each attribute in a feature in step S12-be worth right apart from mark, calculate the similar attribute of each attribute-value centering property value in a feature-be worth right frequency score in step S13, calculate the evidence mark of a feature in step S14 and in step S15 to above-mentioned mark weighted sum calculating gross score.Wherein between step S11-S14, there do not is sequencing.In the invention process process, the user can the requirement to the data precision according to self, decide any one step or the wherein combination of any several steps that whether need to perform step in S11-S14 in its sole discretion, and adjust based on selected step the weighted value of calculating gross score in execution step S15.
Particularly, calculate importance scores together S in step S11 Source, just being based on attribute-be worth right importance to give a fractional value, this fractional value is a normalized numerical value.And whether the right importance of this attribute-be worth comes from this attribute-value important to the source of affiliated object instance.If for example affiliated object instance is sorted in user's specific area Search Results is first to this attribute-value, give its importance scores together for " 1 ", if sequence is the second in user's specific area Search Results, give its importance scores together for " 0.9 " etc.What certainly, the occurrence of this importance scores together can be according to the user need to arrange assignment rule when the original state of system according to the present invention operation.
Shown in Fig. 7 is the process flow diagram that apart from mark calculate right according to execution attribute of the present invention-be worth.
In step 12, calculate attribute under each feature-be worth right apart from score value.Particularly, in step S121, calculate the similarity of attribute under a feature of an object-value centering attribute-name and this domain features.Particularly, in the attribute under a feature-be worth right set, calculate any one attribute-be worth right corresponding attribute-name and the similarity S1 of this domain features.Can adopt the arbitrary string similarity to be measured.Use Dice distance metric English text and SmithWaterman distance metric Chinese text in experiment.Calculate the term techniques well known in view of this similarity, therefore be not described at this.
In step S122, calculate any one attribute-value in attribute under a feature of an object-value pair set to this set in the right average hybrid similarity S2 of other attribute-be worth.Wherein hybrid similarity is considered attribute-name similarity, property value similarity, attribute-name-property value intersection similarity.To the attribute from same object and different objects-it is right to be worth, the calculating of hybrid similarity adopts different matching strategies usually.From the attribute-value of same target, to usually having more similar property value, therefore concerning the attribute of same target-value coupling, the value similarity is more important, and to different objects attribute-value attribute-name similarity concerning mating is more important.But in this step, only for the attribute-value under the same feature of same target to being calculated.According to, for the similarity value between attribute i and attribute j, for being more than or equal to 0 and be less than or equal to 1 real number.This score value is obtained by three part score value weighted sums.This most of score value is attribute-name similarity, property value similarity and the similarity of intersecting.The attribute-name similarity obtains by the text similarity between the computation attribute name.This similarity is calculated and is calculated in multiple prior art, therefore be not repeated in this description.The property value similarity obtains by the distance between the computation attribute value.Through the specification of attribute (clustering processing) afterwards, property value is comprised of numerical value and linear module usually.This property value is not adopted to text similarity tolerance, but directly relatively whether numerical value is equal by Conversion of measurement unit, thereby mates more accurately.If property value is still character string, can carry out the similarity between the computation attribute value according to the mode of above-mentioned computation attribute title similarity.The similarity of intersecting is to obtain by the similarity degree between metric attribute name and value, to excavate more occurrence, for example attribute-value is merely carried out the attribute-name coupling to " Pixels:18000000 " and " Resolution:18megapixels " and the value coupling all obtains very little similarity, and in fact these two attribute-values are to being occurrence, if cross-matched attribute-name " Pixels " is with value " 18megapixels ", with easily discovery is similar.Based on above-mentioned three similarity values be weighted calculate to obtain an attribute-value to and attribute-value pair set in other each attribute-values between hybrid similarity, wherein when being weighted calculating, the weights of each similarity can equate, also can be according to the particular content of feature and different, but three weights and be 1.Calculated all hybrid similarities are averaged, thereby obtain an attribute-value to the average hybrid similarity S2 in this feature.
In step S123, for any one attribute in a feature-it is right to be worth, the attribute-name based on calculating at step S121 and the similarity S of domain features 1With the average hybrid similarity S calculated at step S122 2, carry out weighted sum and calculate, obtain this attribute-be worth right apart from score value:
s distance=w 1s 1+w 2s 2
W wherein 1, w 2∈ [0,1] and w 1+ w 2=1, common w 1>=w 2, in back of the present invention in given EXPERIMENTAL EXAMPLE, w 1=w 2=0.5.
Shown in Fig. 8 is the process flow diagram that the frequency score right according to execution attribute of the present invention-be worth calculated.
As shown in Figure 8, at step S13 calculating frequency score value, popular, be exactly any attribute-be worth other attributes in right property value and this set-the be worth number of times of right property value similarity higher than a predetermined threshold in computation attribute-value pair set.Step S13 comprises step S131-S133.
In step S131, any attribute-the be worth right property value similarity of other property values relatively in computation attribute-value pair set.Particularly, for the property value of nonumeric type, adopt the similarity of character string matching process to calculate similarity.For the Numeric Attributes value, at first judge whether linear module can be changed mutually.If convertible, whether the numerical value after the contrast conversion is identical, otherwise still adopt similarity of character string tolerance.
In step S132, establish a threshold epsilon ∈ [0,1], usually make ε >=0.5.This threshold value can be set according to the unified degree of the use of the corresponding property value of this attribute, if unified degree is higher, can set highlyer, and for example 0.7 or 0.8, if unified degree is lower, can set slightly lowlyer, for example 0.6.Threshold value based on this setting, be greater than the number n of the value of this threshold epsilon in all similarities for a property value that statistics calculates in step S131, wherein comprising this property value and self comparing the acquisition similarity is 1, therefore, must be included in number n.
In step 133, carry out according to the following equation normalized, obtain a certain attribute in a feature-the be worth frequency score S of right property value Frequency:
s frequency = n N
Wherein N is the right number of an all properties under feature-be worth.
In step 14, calculate the evidence score value.Usually, in same specific area under same domain features, the attribute of different objects-be worth right property value has certain similarity.Object value under the same characteristic features of other objects based on having calculated according to the present invention, calculate the attribute of current feature-be worth similarity between right property value and existing object value, and obtain thus and other object values between average hybrid similarity S Evidence.The mode of calculating average hybrid similarity is identical with the account form of step S122, therefore at this, is not repeated.
Finally, in ordered steps S1, execution step S15 calculates the right total score value S of current attribute-be worth Value.Particularly, adopting exactly following formula to carry out weighted sum to above-mentioned four score values calculates:
s value=w ss source+w ds distance+w fs frequency+w es evidnce
W wherein s, w d, w f, w e∈ [0,1] and w s+ w d+ w f+ w e=1, common w d>=w f>=w e>=w s.
To carrying out above-mentioned steps S 1, obtain thus the right total score value S of all properties of this feature-be worth for all properties of current feature-value ValueThereby according to different score values, sort obtain have put in order attribute-it is right to be worth.
In order for normalized feature, to give a final object value (being normalized property value), therefore also need to all sequences attribute-being worth right property value carries out standardization processing.
Shown in Fig. 9 is to attribute-the be worth overview flow chart that right property value carries out standardization processing after sequence according to of the present invention.As shown in Figure 9, the value of thinking standardization module S2 input obtain attribute-value after sorting to and gross score separately.In step S21, by adaptive filtering, filter the noise figure in the property value after sequence.Then, in step S22, the execution value extracts to process and obtains final object value.
Usually, in a data set, unavoidably can there is noise.These so-called noises comprise such as wrong data, should not belong to the data of this set etc.More accurate for the value extraction result of carrying out the public substring extraction in step S22, need to pass through in advance adaptive filtering noise reduction data.Therefore, in step S21, for carrying out adaptive filtering, be provided with the upper limit that two predetermined thresholds define respectively lower limit and the smallest size of maximum-norm, i.e. maximum-norm threshold value tL and smallest size threshold value sL.Adopt respectively the integer that is greater than 25 in the EXPERIMENTAL EXAMPLE of describing in the back and be less than 15 integer as threshold value tL and sL.The setting of these two threshold values can be self-defined according to the scale of data set.Carry out in the following way adaptive filtering according to the embodiment of the present invention.
Suppose that attribute-be worth right number is N,
When N >=tL, reserved property-value to the sequence in pair set before percentage x property value and abandon remaining property value;
When N≤sL, reserved property-value to the sequence in pair set before percentage y property value and abandon remaining property value;
If do not meet above-mentioned two conditions, reserved property-value to the sequence in pair set before percentage z property value and abandon remaining property value.
Wherein, x, y, z ∈ [0,1] and y >=z >=x.Be configured to x=60%, y=80%, z=70% in experiment.
Shown in Figure 10 is according to attribute after self-adaptation misgivings are processed of the present invention-the be worth process flow diagram of right property value value extraction.
As shown in figure 10, in step S221, adopt following formula to calculate the average length len of all properties value in attribute after the self-adaptation misgivings are processed-value pair set avg:
len avg = Σ len value s value N
The length l en of its intermediate value ValueThe number of word in the property value character string, S ValueIt is the right total score value of each attribute-be worth.Afterwards, in step S222, utilize following formula to calculate to form attribute-the be worth fractional value S of the number of times occurred in each the comfortable set of all words in right property value set Word:
s word = tf word Σ word ∈ value s value
Tf wherein WordWord frequency for word in value set.If a word occurs in a plurality of property values, to the right total score value S of all properties under this word-be worth ValueBe multiplied by word frequency tf after summation WordThereby, obtain the fractional value of this word.Afterwards in step S223, adopt the public substring of getattr value character string in the property value set of the right set of the attribute of any existing method from adaptive filtering-be worth, and above length in obtained public substring is less than or equal to, the average length len of all properties value of acquisition calculates in institute avgPublic substring as candidate target value (being the candidate value of the corresponding last attribute value of feature).Then, in step S224, calculate the fractional value S of all words based in each candidate's public substring according to following formula WordSue for peace and obtain the mark of each candidate's public substring:
s candidate = Σ word ∈ candidate s word
Finally, in step S225, choose in the candidate value that mark is the highest public substring as the value of object.
Shown in Figure 11 is an experimental result according to the inventive method (language is respectively English and Chinese).In figure, the result of display attribute value sequence is that reasonably the object value after standardization meets optimization aim, brief, correct and representative.
The sequence of operations illustrated in instructions can be carried out by the combination of hardware, software or hardware and software.When by software, carrying out this sequence of operations, can be installed to computer program wherein in the storer in the computing machine that is built in specialized hardware, make computing machine carry out this computer program.Perhaps, can be installed to computer program in the multi-purpose computer that can carry out various types of processing, make computing machine carry out this computer program.
For example, can be pre-stored to hard disk or ROM(ROM (read-only memory) as recording medium using computer program) in.Perhaps, can be temporarily or for good and all storage (record) computer program in removable recording medium, such as floppy disk, CD-ROM(compact disc read-only memory), the MO(magneto-optic) coil, the DVD(digital versatile disc), disk or semiconductor memory.Can provide so removable recording medium as canned software.
The present invention has been described in detail with reference to specific embodiment.Yet clearly, in the situation that do not deviate from spirit of the present invention, those skilled in the art can carry out change and replace embodiment.In other words, the present invention is open by the form of explanation, rather than explains with being limited.Judge main idea of the present invention, should consider appended claim.

Claims (11)

1. the method for the property value of the isomery example of an align objects comprises:
The right attribute-name of the attribute of the isomery example of same target-be worth is carried out to specification of attribute processing and obtain domain features;
To belonging to all properties-value in attribute under obtained domain features-value pair set to being sorted; And
Select the object value of suitable public substring as described object in all properties value of all properties from sequence-value centering.
2. the method for claim 1 wherein comprises sorting all properties-value belonged in attribute under obtained domain features-value pair set:
The right importance score value s of this attribute-be worth is calculated in the source of the right object instance of each attribute based in attribute-value pair set-be worth Source
Attribute-value based in attribute-value pair set between similarity calculate each attribute-be worth right apart from score value s Distance
Similarity between the property value of the attribute based in attribute-value pair set-value centering, the right frequency score s of computation attribute-be worth Frequency
The property value of the attribute based in attribute-value pair set-value centering and and the existing object value of other object of described object same domain between similarity, calculate the right evidence score value s of described attribute-be worth EvidenceAnd
At least two score values in score value based on above-mentioned calculated, carry out weighted sum, to calculate the right total score value S of each attribute in described attribute-value pair set-be worth Value.
3. method as claimed in claim 2, wherein, can calculate total score value S with following formula based on all four score values Value:
s value=w ss source+w ds distance+w fs frequency+w es evidence
W wherein s, w d, w f, w e∈ [0,1] and w s+ w d+ w f+ w e=1.
4. method as claimed in claim 3, wherein, w d>=w f>=w e>=w s.
5. method as claimed in claim 2, wherein attribute-the value based in attribute-value pair set between similarity calculate each attribute-be worth right apart from score value s DistanceComprise:
Calculate any one attribute-attribute-name of value centering and the similarity s of domain features by the method for compare string string 1
Described any one attribute-value in computation attribute-value pair set to and other attribute-values between average hybrid similarity s 2
To calculated similarity s 1With average hybrid similarity s 2Carry out following weighted sum calculate obtain described any one attribute-be worth right apart from score value S Distance:
s distance=w 1s 1+w 2s 2
W wherein 1, w 2∈ [0,1] and w 1+ w 2=1, and w 1>=w 2.
6. method as claimed in claim 2, the similarity between the property value of the attribute based in attribute-value pair set-value centering wherein, the right frequency score s of computation attribute-be worth FrequencyComprise:
Similarity between the property value of any one attribute in computation attribute-value pair set-value centering and the property value of other attributes-value centering;
Each similarity and a predetermined threshold threshold epsilon between the property value of more described any one attribute-value centering and the property value of other attributes-value centering, and count the number n that similarity is greater than the value of this threshold epsilon;
Calculate the right frequency score s of described any one attribute-be worth based on following formula Frequency:
s frequency = n N
Wherein N is attribute in attribute-value pair set-be worth right number, and 1 >=ε >=0.5.
7. method as claimed in claim 2, wherein calculate the right evidence score value s of described attribute-be worth EvidenceBe based on attribute in attribute-value pair set-value centering property value and and the existing object value of other object of described object same domain between average hybrid similarity.
8. as the described method of aforementioned any one claim, wherein in all properties value of all properties from sequence-value centering, select suitable public substring to comprise as the object value of described object:
Relatively the right quantity N of the attribute in attribute-value pair set-be worth and predetermined maximum-norm threshold value tL and smallest size threshold value sL, filter to eliminate noise to all properties-value adaptively to execution;
The property value execution value right to the attribute after filtration treatment-be worth extracts, thereby therefrom selects the object value of suitable public substring as described object.
9. method as claimed in claim 8, the wherein relatively right quantity N of the attribute in attribute-value pair set-be worth and predetermined maximum-norm threshold value tL and smallest size threshold value sL, filter adaptively to eliminate noise to all properties-value to execution and comprise:
If N >=tL, all properties after the reservation sequence-attribute-it is right to be worth for first x of value centering;
If N≤tS, all properties after the reservation sequence-attribute-it is right to be worth for first y of value centering; Or
In the situation that do not meet two of fronts condition, all properties after the reservation sequence-attribute-it is right to be worth for first z of value centering,
X wherein, y, z ∈ [0,1] and y >=z >=x.
10. method as claimed in claim 8, wherein to the attribute after filtration treatment-be worth right property value execution value to extract, thereby therefrom select suitable public substring to comprise as the object value of described object:
According to the attribute in following formula computation attribute-value pair set-the be worth average length len of right property value avg:
len avg = Σ len value s value N
Its intermediate value length l en ValueThe number of word in the property value character string, S ValueIt is the right total score value of attribute under corresponding property value-be worth;
According to the attribute in following formula computation attribute-value pair set-the be worth frequency tf that in right property value, each word occurs in this all properties value WordCalculate the score value s of each word Word:
s word = tf word Σ word ∈ value s value
S wherein ValueIt is the right total score value of attribute under a word place property value-be worth;
Attribute in extraction attribute-value pair set-be worth the public substring of the character string of right property value, and length in extracted public word string is less than or equal to len avgThe candidate value as object value; And
All word score value s in character string to each candidate value WordSummation obtains the mark s of each candidate value Candidate, and by highest score s CandidateCorresponding candidate value is as final object value.
11. the system of the property value of the isomery example of an align objects comprises:
Attribute-name standardization module, carry out specification of attribute processing to the right attribute-name of the attribute of the isomery example of same target-be worth and obtain domain features;
The value order module, to belonging to all properties-value in attribute under obtained domain features-value pair set to being sorted; And
Property value standardization module, select the object value of suitable public substring as described object in all properties value of all properties from sequence-value centering.
CN201210166855.2A 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances Active CN103425711B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210166855.2A CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances
JP2013109182A JP2013246826A (en) 2012-05-25 2013-05-23 Attribute values alignment system for differently structured object instances, method and program of attribute values alignment system for differently structured object instances

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210166855.2A CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances

Publications (2)

Publication Number Publication Date
CN103425711A true CN103425711A (en) 2013-12-04
CN103425711B CN103425711B (en) 2017-08-25

Family

ID=49650466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210166855.2A Active CN103425711B (en) 2012-05-25 2012-05-25 Object value alignment schemes based on many object instances

Country Status (2)

Country Link
JP (1) JP2013246826A (en)
CN (1) CN103425711B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778205A (en) * 2015-03-09 2015-07-15 浙江大学 Heterogeneous information network-based mobile application ordering and clustering method
CN104965869A (en) * 2015-06-09 2015-10-07 浙江大学 Mobile application sorting and clustering method based on heterogeneous information network
CN107807939A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 The method for sorting and equipment of data object

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202041B (en) * 2016-07-01 2019-07-09 北京奇虎科技有限公司 A kind of method and apparatus of entity alignment problem in solution knowledge mapping
CN110147487B (en) * 2017-10-17 2023-07-04 阿里巴巴华南技术有限公司 Method and system for determining object heat and processing equipment
CN111459990B (en) * 2020-03-31 2021-07-06 腾讯科技(深圳)有限公司 Object processing method, system, computer readable storage medium and computer device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829778B1 (en) * 2000-11-09 2004-12-07 Koninklijke Philips Electronics N.V. Method and system for limiting repetitive presentations based on content filtering
CN1716259A (en) * 2004-05-14 2006-01-04 微软公司 Method and system for ranking objects based on intra-type and inter-type relationships
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata
CN102129631A (en) * 2010-01-13 2011-07-20 阿里巴巴集团控股有限公司 Method, equipment and system for SPU attribute aggregation
CN102402535A (en) * 2010-09-13 2012-04-04 阿里巴巴集团控股有限公司 Method and system for constructing product library

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829778B1 (en) * 2000-11-09 2004-12-07 Koninklijke Philips Electronics N.V. Method and system for limiting repetitive presentations based on content filtering
CN1716259A (en) * 2004-05-14 2006-01-04 微软公司 Method and system for ranking objects based on intra-type and inter-type relationships
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata
CN102129631A (en) * 2010-01-13 2011-07-20 阿里巴巴集团控股有限公司 Method, equipment and system for SPU attribute aggregation
CN102402535A (en) * 2010-09-13 2012-04-04 阿里巴巴集团控股有限公司 Method and system for constructing product library

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
孔青: "面向Web数据集成的实体统一技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
张好军: "Web数据集成中数据清洗的关键问题研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
朱凯: "基于结构和视觉特征的网页信息抽取技术的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778205A (en) * 2015-03-09 2015-07-15 浙江大学 Heterogeneous information network-based mobile application ordering and clustering method
CN104778205B (en) * 2015-03-09 2019-02-15 浙江大学 A kind of mobile application sequence and clustering method based on Heterogeneous Information network
CN104965869A (en) * 2015-06-09 2015-10-07 浙江大学 Mobile application sorting and clustering method based on heterogeneous information network
CN107807939A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 The method for sorting and equipment of data object
CN107807939B (en) * 2016-09-09 2021-12-28 阿里巴巴集团控股有限公司 Data object sorting method and device

Also Published As

Publication number Publication date
CN103425711B (en) 2017-08-25
JP2013246826A (en) 2013-12-09

Similar Documents

Publication Publication Date Title
Elmeleegy et al. Mashup advisor: A recommendation tool for mashup development
CN100590617C (en) Phrase-based indexing method and system in an information retrieval system
CN1728142B (en) Phrase identification method and device in an information retrieval system
CN103425711A (en) Object value aligning method based on multiple object instances
CN101692223A (en) Refining a search space inresponse to user input
CN105426514A (en) Personalized mobile APP recommendation method
WO2008106667A1 (en) Searching heterogeneous interrelated entities
CN101566997A (en) Determining words related to given set of words
CN102262642B (en) Web image search engine and realizing method thereof
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN101840397A (en) Word sense disambiguation method and system
CN105760495A (en) Method for carrying out exploratory search for bug problem based on knowledge map
CN103793434A (en) Content-based image search method and device
CN103559191A (en) Cross-media sorting method based on hidden space learning and two-way sorting learning
CN101751439A (en) Image retrieval method based on hierarchical clustering
Noor et al. A survey of automatic deep web classification techniques
CN102880721A (en) Implementation method of vertical search engine
CN103034627A (en) Method and device for calculating sentence similarity and method and device for machine translation
CN103020083B (en) The automatic mining method of demand recognition template, demand recognition methods and corresponding device
CN102314464B (en) Lyrics searching method and lyrics searching engine
CN103246685A (en) Method and equipment for normalizing attributes of object instance into features
Khan et al. SwICS: Section-wise in-text citation score
CN101226547A (en) Web entity recognition method for entity recognition system
CN109947914A (en) A kind of software defect automatic question-answering method based on template
CN111651477B (en) Multi-source heterogeneous commodity consistency judging method and device based on semantic similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant