CN102750289A

CN102750289A - Tag group classifying method and equipment as well as data mixing method and equipment

Info

Publication number: CN102750289A
Application number: CN2011101015142A
Authority: CN
Inventors: 张军; 钟朝亮; 王主龙; 大木宪二; 田中昌弘; 粂照宣; 松尾昭彦
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-04-19
Filing date: 2011-04-19
Publication date: 2012-10-24
Anticipated expiration: 2031-04-19
Also published as: CN102750289B; JP5928091B2; JP2012226740A

Abstract

The invention discloses a tag group classifying method and equipment as well as a data mixing method and equipment, wherein a tag group comprises at least one tag and corresponding data defined by at least one of the tags. The classifying method comprises the following steps of: determining a synonymous tag set to which each tag of each tag group belongs from a plurality of synonymous tag sets; generating feature vectors corresponding to the tag groups, wherein all elements in the generated feature vectors respectively correspond to different synonymous tag sets in the plurality of synonymous tag sets, and the value of each element is equal to the amount of tags belonging to the synonymous tag sets corresponding to the elements in the tag groups; calculating the similarity between each feature vector and a core feature vector of each class in at least one class, wherein the value of each element of the core feature vector of the class is equal to the sum of values of corresponding elements in corresponding feature vectors of all the tag groups classified to the class; and classifying the tag groups to a class similar to at least one class according to the calculated similarity.

Description

Set of tags sorting technique, equipment and data mixing method, equipment

Technical field

The present invention relates to data processing, relate more specifically to the sorting technique and the equipment of set of tags, and the data mixing method and apparatus.

Background technology

At present; There are the various data format standards that are used for data of description; XML (eXtensible Markup Language for example; Extend markup language), JSON (JavaScript Object Notation, JavaScript object representation) or CSV (Comma Separated Values, comma separated value) etc.In every kind of data format standard, defined the label of the implication that is used for the data of description content respectively.For example, for the data of tabulation type,, can define one group of label: title (title) of being used to describe news content, pubdate (issuing time), author (author) etc. for example for the news list that comprises some news; Again for example; For the schedule table that has comprised several schedules, can define one group of label: starttime (start time), endtime (concluding time), attendees (participator) and the location (place) etc. that are used to describe the schedule content.Therefore, utilize and to organize label, can issue easily or the visit data content.

But for the data content of identical or similar meaning, different data format standards possibly adopt different labels to describe.For example, " create the people of data " to data content, different data format standards possibly adopt " author (author) ", " writer (writer) " or different labels such as " creater (creators) ".Therefore, there is such demand: discern the data content of the same or similar implication of describing with different labels, and describe above-mentioned same or analogous data content, thereby accomplish the mixing of the data content of same or similar implication with unified label.

In the prior art, whether same or similarly itself judge between a plurality of data contents through a direct many data content.Because the data volume of data content itself is bigger, therefore directly many individual data contents itself often cause calculated amount big, and the accuracy of judging are also relatively poor.

In addition, also exist in the prior art through relatively whether more same or similar between two labels and judge whether same or analogous technology of two described data contents of label.But, there is various data format standard in actual the use, the label that it adopted also varies.Iff compares label and label, is difficult to take all factors into consideration the various features of various labels, and the accuracy that causes judging is also relatively poor.

And, as stated,, can define the one group of label (being called " set of tags " hereinafter) that is used to describe the news item content: title (title), pubdate (issuing time), author (author) etc. for example for the news list that comprises some news.This shows that a data content generally is defined by the set of tags that comprises several labels of describing this data content.Therefore, judge whether have same or similar implication between many data contents, whether same or similarly should comprehensively judge between a plurality of set of tags that are used to describe many data contents.If only label and label are compared, then be difficult to judge with comprising whether the described data content of set of tags of several labels has same or similar implication.

Summary of the invention

Consider the problems referred to above, the applicant recognizes should be through more a plurality of set of tags same or similar data content with same or similar implication of discerning whether.Whether core concept of the present invention is, same or similar for more a plurality of set of tags, can earlier same or analogous set of tags be divided into same type, again type comparing newfound set of tags and the set of tags of being divided.Because all set of tags in same type all are same or analogous, so the class of set of tags has been taken all factors into consideration the various features of various set of tags.So, through with type the comparing of set of tags and set of tags, can judge same or similar between the set of tags more accurately.

According to one embodiment of present invention; A kind of method that set of tags is classified is provided; Wherein set of tags comprises at least one label and by the corresponding data of at least one label definition, said method comprises: in a plurality of synonym tally sets, confirm the synonym tally set under each label of set of tags; Generate and the corresponding proper vector of set of tags; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element; Similarity in calculated characteristics vector and at least one type between the core feature vector of each type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and; According to the similarity that is calculated, set of tags is categorized in the class approaching at least one type.

Above-mentioned classifying step comprises: whether surpass predetermined threshold according to the similarity between each type in the set of tags that is calculated and at least one type, confirm whether each type is approaching class at least one type; And if do not have approaching class at least one type, then set of tags is categorized in the new class.

In above-mentioned classifying step, a plurality of if approaching class has, then set of tags is categorized in the pairing class of the maximum similarity that is calculated.

Above-mentioned similarity comprises the cosine similarity.

According to another embodiment of the present invention; A kind of equipment that set of tags is classified is provided; Wherein set of tags comprises at least one label and by the corresponding data of at least one label definition; The said equipment comprises: the synonym tally set is confirmed the unit, is used for the synonym tally set under each label of the definite set of tags of a plurality of synonym tally sets; The proper vector generation unit; Be used for generating and the corresponding proper vector of set of tags; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element; Similarity calculated; Be used for the similarity between the core feature vector of calculated characteristics vector and each type of at least one type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and; And the set of tags taxon, be used for set of tags being categorized in the approaching class of at least one type according to the similarity that is calculated.

Above-mentioned set of tags taxon comprises: type definite unit, be used for whether surpassing predetermined threshold according to the similarity between the set of tags that is calculated and each type of at least one type, and confirm whether each type is approaching class at least one type; And if do not have approaching class at least one type, then set of tags is categorized in the new class.

Above-mentioned type of determining unit also is used for: a plurality of if approaching class has, then set of tags is categorized in the pairing class of the maximum similarity that is calculated.

Above-mentioned similarity comprises the cosine similarity.

According to another embodiment of the present invention, provide a kind of and based on set of tags data have been carried out method of mixing, said method comprises: use the above-mentioned method that set of tags is classified that set of tags is categorized at least one type; And each label of each set of tags in same type replaced with the specify labels in the synonym tally set under it respectively.

According to another embodiment of the present invention, a kind of equipment that data is mixed based on set of tags is provided, the said equipment comprises: taxon is used to use the above-mentioned equipment that set of tags is classified that set of tags is categorized at least one type; And the replacement unit, be used for each label with same type each set of tags and replace with the specify labels in the synonym tally set under it respectively.

The present invention is through comparing the similarity between the core feature vector of the class of the proper vector of set of tags and set of tags; Can more accurately, more effectively judge same or similar between the set of tags, so can be more accurately, more effectively same or analogous data are mixed.

Description of drawings

With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose, characteristics and advantage of the present invention to the embodiment of the invention with being more prone to.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.

Fig. 1 is the process flow diagram that the method that set of tags is classified is shown according to an embodiment of the invention;

Fig. 2 is the process flow diagram that the idiographic flow of the classifying step in the method that set of tags is classified is shown according to an embodiment of the invention;

Fig. 3 is the block scheme that the equipment that set of tags is classified according to another embodiment of the present invention is shown;

Fig. 4 be illustrate according to another embodiment of the present invention data are carried out the process flow diagram of method of mixing based on set of tags;

Fig. 5 is the block scheme that the equipment that data is mixed based on set of tags according to another embodiment of the present invention is shown.

Fig. 6 is the block diagram that the exemplary configurations that wherein realizes computing machine of the present invention is shown.

Embodiment

Used term among this paper only is in order to describe certain embodiments, and is not intended to limit the present invention." one " of used singulative and " being somebody's turn to do " are intended to also comprise plural form among this paper, only if point out separately clearly in the context.Also to know; When " comprising " speech and using in this manual; Explain and have pointed characteristic, integral body, step, operation, unit and/or assembly; Do not exist or increase one or more further features, integral body, step, operation, unit and/or assembly but do not get rid of, and/or their combination.

Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.The combination of blocks can be realized by computer program instructions in each square frame of process flow diagram and/or block diagram and process flow diagram and/or the block diagram.These computer program instructions can offer the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus; Thereby produce a kind of machine; Make and these instructions of carrying out through computing machine or other programmable data treating apparatus produce the device of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.

Also can be stored in these computer program instructions in ability command calculations machine or the computer-readable medium of other programmable data treating apparatus with ad hoc fashion work; Like this, the instruction that is stored in the computer-readable medium produces a manufacture that comprises the command device (instruction means) of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.

Also can be loaded into computer program instructions on computing machine or other programmable data treating apparatus; Make and on computing machine or other programmable data treating apparatus, carry out the sequence of operations step; Producing computer implemented process, thereby the instruction of on computing machine or other programmable device, carrying out just provides the process of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.

Should be understood that process flow diagram and block diagram in the accompanying drawing, illustrate the system according to various embodiments of the invention, architectural framework in the cards, function and the operation of method and computer program product.In this, each square frame in process flow diagram or the block diagram can be represented the part of module, program segment or a code, and the part of said module, program segment or code comprises one or more executable instructions that are used to realize the logic function stipulated.Should be noted that also the order that the function that is marked in the square frame also can be marked to be different from the accompanying drawing takes place in some realization as replacement.For example, in fact the square frame that two adjoining lands are represented can be carried out basically concurrently, and they also can be carried out by opposite order sometimes, and this decides according to related function.Also be noted that; Each square frame in block diagram and/or the process flow diagram and the combination of the square frame in block diagram and/or the process flow diagram; Can realize with the hardware based system of the special use of function that puts rules into practice or operation, perhaps can use the combination of specialized hardware and computer instruction to realize.

With reference to figure 1 method of according to an embodiment of the invention set of tags being classified is described below.Fig. 1 is the process flow diagram that the method that set of tags is classified is shown according to an embodiment of the invention.

As shown in Figure 1, this method begins from step 100.Then, in step 102, the synonym tally set in a plurality of synonym tally sets under each label of definite set of tags.

Synonym tally set (S) is by one group of set that label constituted with same or similar meaning (being synonym).As an example, can there be following several synonym tally sets:

S ₁: author (author), creator (creator), writer (writer)

S ₂: pubdate (announcement time), publishdate (issuing time)

S ₃: URL (uniform resource locator), link (link)

S ₄: summary (summary), description (general introduction)

S ₅: event (incident), title (title), what (what)

S ₆: starttime (start time), when (when)

S ₇: where (where), location (place)

……

S _n: who (who), attendees (participator)

Wherein, n is the integer more than or equal to 1.

Above-mentioned synonym tally set only is an example, can also have other synonym tally set as required.Experience in can using according to reality comes definite in advance which label to represent same or analogous meaning.In addition, also can in use constantly newfound label with same or similar meaning be added in the above-mentioned synonym tally set, to dynamically update above-mentioned synonym tally set.Can above-mentioned synonym tally set be provided with the for example form of synonym dictionary.It will be understood by those skilled in the art that to provide above-mentioned synonym tally set with the alternate manner of for example database.

Set of tags (T) is by the one group of set that label constituted that is respectively applied for the corresponding data in data clauses and subclauses of definition.As an example, can there be following several set of tags:

T ₁: title (title), author (author), pubdate (announcement time), summary (summary)

T ₂: title (title), publishdate (issuing time), creator (founder), description (general introduction), URL (uniform resource locator)

T ₃: title (title), link (link), writer (writer), description (general introduction)

T ₄: title (title), link (link), writer (writer), description (general introduction)

T ₅: event (incident), starttime (start time), endtime (concluding time), location (place), attendees (participator)

T ₆: title (title), starttime (start time), duration (duration), where (where), attendees (participator)

……

T _p: what (what), where (where), who (who), when (when)

Wherein, p is the integer more than or equal to 1.

Above-mentioned set of tags only is an example, can also have other set of tags in actual the use.For example, different data format standard (for example, XML, JSON or CSV etc.) can define different set of tags, and perhaps the publisher of data also can be according to the self-defined different set of tags of the needs of oneself.

To a new set of tags, can confirm the described synonym tally set of each label in the new set of tags according to above-mentioned synonym tally set.For example, to above-mentioned set of tags T ₁, can be according to set of tags T ₁In the order of each label come to confirm successively: set of tags T ₁In label " title (title) " belong to synonym tally set S ₅(be set of tags T ₁In belong to synonym tally set S ₅Number of labels be 1), set of tags T ₁In label " author (author) " belong to synonym tally set S ₁(be set of tags T ₁In belong to synonym tally set S ₁Number of tags be 1), set of tags T ₁In label " announcement time " belong to synonym tally set S ₂(be set of tags T ₁In belong to synonym tally set S ₂Number of tags be 1), and set of tags T ₁In label " summary (summary) " belong to synonym tally set S ₄(be set of tags T ₁In belong to synonym tally set S ₄Number of tags be 1).In addition, to above-mentioned set of tags T ₁, also can be according to above-mentioned synonym tally set S ₁To synonym S _nOrder come successively to confirm: set of tags T ₁In belong to synonym tally set S ₁Number of tags be 1, tally set T ₁In belong to synonym tally set S ₂Number of tags be 1, tally set T ₁In belong to synonym tally set S ₃Number of tags be 0, tally set T ₁In belong to synonym tally set S ₄Number of tags be 1, tally set T ₁In belong to synonym tally set S ₅Number of tags be 1, tally set T ₁In belong to synonym tally set S ₆Number of tags be 0, and tally set T ₁In belong to synonym tally set S ₇To synonym tally set S _nNumber of tags be 0.Can confirm above-mentioned set of tags T respectively after the same method ₂To set of tags T _PIn each set of tags in each label belong to above-mentioned synonym tally set S respectively ₁To synonym tally set S _nIn which tally set.

Then, this method proceeds to step 104.In step 104; Generate and the corresponding proper vector of set of tags; In the proper vector that is generated, each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element.

Based on definite result of above-mentioned steps 102, can generate and the corresponding characteristic vector of set of tags.For example, to set of tags T ₁, corresponding to according to set of tags T ₁In definite result of order of each label, can generate and set of tags T ₁Corresponding proper vector A: (S ₅: 1, S ₁: 1, S ₂: 1, S ₄: 1), wherein, the part in each element before the colon is represented the pairing synonym tally set of this element, and the part in each element after the colon representes to belong in the set of tags 1 number with the label of the corresponding synonym tally set of this element.For example, to first element " S of proper vector A ₅: 1 ", " S ₅" represent that this first element is corresponding to synonym tally set S ₅, and " 1 " expression set of tags T ₁In belong to synonym tally set S ₅The number of label be 1.In addition, to set of tags T ₁, corresponding to according to above-mentioned synonym tally set S ₁To synonym tally set S _nDefinite result of order, can generate and set of tags T ₁Corresponding proper vector A ': (S ₁: 1, S ₂: 1, S ₃: 0, S ₄: 1, S ₅: 1, S ₆: 0, S ₇: 0 ..., S _n: 0), wherein identical among the implication of the each several part of each element and the above-mentioned proper vector A repeated no more at this.After the same method, can generate respectively and above-mentioned set of tags T ₁To set of tags T _pIn the corresponding proper vector of each set of tags.

Then, this method proceeds to step 106.In step 106; Similarity in calculated characteristics vector and at least one type between the core feature vector of each type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and.

Class is by one group of mutually the same or similar set that set of tags constituted, and promptly belonging to of a sort each set of tags is same or analogous each other.Can for example whether same or similarly judge between the set of tags according to the distance of the cosine between the set of tags.Process in the face of the cosine distance between the computation tag group describes down.

Suppose to have generated and set of tags T according to above-mentioned steps 104 ₁Corresponding proper vector A and having generated and set of tags T ₂Corresponding proper vector B, wherein, proper vector A can be expressed as (S ₁: f _A1, S ₂: f _A2..., S _n: f _An), can be abbreviated as (f _A1, f _A2..., f _An); Proper vector B can be expressed as (S ₁: f _B1, S ₂: f _B2..., S _n: f _Bn), can be abbreviated as (f _B1, f _B2..., f _Bn).Wherein, S _nN the pairing synonym tally set of element S among representation feature vector A or the proper vector B _n, f _AnExpression set of tags T ₁In belong to proper vector A in n the corresponding synonym tally set of element S _nThe number of label, f _BnExpression set of tags T ₂In belong to proper vector B in n the corresponding synonym tally set of element S _nThe number of label.Can use following formula (1) to calculate corresponding to set of tags T ₁Proper vector A with corresponding to set of tags T ₂Proper vector B between the cosine similarity:

Similarity (A, B)=(∑ f _Ak* f _Bk)/sqrt [(∑ f _Ak* f _Ak) * (∑ f _Bk* f _Bk)] formula (1)

Wherein, 1≤k≤n, n are the integer more than or equal to 1.

For the class that set of tags constituted by a group, can be for example through the pairing core feature vector of mode type of acquisition of the respective element in pairing each proper vector of each set of tags in type of adding up.For example, has set of tags T among type of the being classified into C among type of the hypothesis C ₁To set of tags T _m(m is the integer more than or equal to 1), and set of tags T ₁To set of tags T _mPairing proper vector is respectively proper vector A ₁To proper vector A _m, the pairing core feature vector of type C A then _CCan use following formula (2) to represent:

AC=(∑ f _Aj1, ∑ f _Aj2..., ∑ f _Ajn) formula (2)

1≤j≤m wherein, m is the integer more than or equal to 1.

According to the pairing core feature of formula (2) type of calculating C vector A _CAfterwards, can use above-mentioned formula (1) to calculate a new set of tags T _NEPairing proper vector A _NEWith the pairing core feature vector of class C A _CBetween similarity.If have a plurality of types, then calculate new set of tags T respectively _NEPairing proper vector A _NEAnd the similarity between the pairing core feature vector of each type in a plurality of types.

Then, this method proceeds to step 108.In step 108, set of tags is categorized in the class approaching at least one type based on the similarity that is calculated.

The size of the value of the cosine similarity between pairing proper vector of set of tags that is calculated according to above-mentioned formula (1) and type pairing core feature vector has been represented the similarity degree between set of tags and the class; And the value of cosine similarity is big more, then set of tags with type between similar more.Therefore, can judge whether set of tags is similar with class, thereby set of tags is categorized in the class near (promptly similar) based on the similarity that is calculated.

At last, this method proceeds to step 110.In step 110, this method finishes.

The overall flow of the method that set of tags is classified has been described according to an embodiment of the invention above.Be described in the idiographic flow of the classifying step in the above-mentioned method that set of tags is classified below with reference to figure 2.Fig. 2 is the process flow diagram that the idiographic flow of the classifying step in the method that set of tags is classified is shown according to an embodiment of the invention.

As shown in Figure 2, after the similarity between the pairing core feature vector of each type in calculate the pairing proper vector of set of tags and a plurality of class respectively according to above-mentioned steps 106, this method proceeds to step 200.In step 200, similarity between each type and predetermined threshold in the set of tags that calculates and at least one type are compared.This predetermined threshold can preestablish as required, and can adjust as required in actual use.Through the size of adjustment threshold value, can control the precision that set of tags is classified.

Suppose 3 classes of forming by set of tags of current existence, be expressed as C respectively ₁, C ₂And C ₃Class C ₁, C ₂And C ₃Pairing core feature vector is respectively A ₁, A ₂And A ₃When finding a new set of tags T _NEThe time, confirm the set of tags T that this is new _NEPairing proper vector is A _NEDifference calculated characteristics vector A _NEWith core feature vector A ₁, A ₂And A ₃Between similarity.For example, under the situation that adopts the cosine similarity, the value of the similarity that calculates can be respectively 0.92,0.85 and 0.79.After calculating the value of above-mentioned similarity, the value 0.92,0.85 and 0.79 of above-mentioned similarity is compared with predetermined threshold respectively.

Then, this method proceeds to step 202.In step 202, judge whether the set of tags and the similarity between each type at least one type that are calculated surpass predetermined threshold.If the judged result of step 202 is a "No", promptly set of tags and all equal dissmilarities of class then proceed to step 206.In step 206, set of tags is categorized in the new class, make to comprise this set of tags in this new class.

In the above example, suppose that predetermined threshold is 0.93.Because the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated does not all surpass predetermined threshold 0.93, therefore new set of tags T _NEWith current class C ₁, C ₂And C ₃All dissimilar.At this moment, can set up a new class C ₄, and with new set of tags T _NEBe categorized into new class C ₄In, make new class C ₄Comprise new set of tags T _NE

If the judged result of step 202 is " being ", then proceed to step 204.In step 204, judge whether to have a plurality of greater than the pairing class of the similarity of predetermined threshold, judge that promptly whether similarity between set of tags and a plurality of classes is all greater than predetermined threshold.If the judged result of step 204 is " denying ", the expression set of tags only and the similarity between some type greater than predetermined threshold, promptly the number greater than the similarity of predetermined threshold is 1, then proceeds to step 210.In step 210, set of tags is categorized in pairing that type of unique similarity above predetermined threshold that is calculated.

In the above example, suppose that predetermined threshold is 0.90.Owing in the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated, only have the value 0.92 of similarity to surpass predetermined threshold 0.90, therefore with new set of tags T _NEBe categorized into 0.92 pairing type of C of value of above-mentioned similarity ₁In.

If the judged result of step 204 is " being ", the similarity between expression set of tags and a plurality of class is greater than predetermined threshold, and promptly the number greater than the similarity of predetermined threshold is a plurality of, then proceeds in the step 208.In step 208, select greater than similarity maximum in a plurality of similarities of predetermined threshold, and set of tags is categorized in pairing that type of selected maximum similarity.

In the above example, suppose that predetermined threshold is 0.80.Because in the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated; The value 0.92 and 0.85 of similarity is all above predetermined threshold 0.80; Therefore in the value 0.92 and 0.85 of the similarity that surpasses predetermined threshold 0.80, select the value of maximum similarity, i.e. the value 0.92 of similarity.Then, with new set of tags T _NEBe categorized into 0.92 pairing type of C of value of the similarity of above-mentioned maximum ₁In.

In step 206, after 208 and 210, proceed to step 212.In step 212, this method stops.

Hereinbefore, utilize the cosine similarity come similarity and the set of tags between computation tag group and the set of tags and the class that constitutes by set of tags between similarity.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, if can calculate similarity or the set of tags between set of tags and the set of tags and the class that constitutes by set of tags between similarity.

Hereinbefore, the quantity of included set of tags is dynamically to increase in the class.After in set of tags being categorized into certain type according to the above-mentioned method that set of tags is classified, quantity of included set of tags just increases one in such.Preferably; Can new set of tags is categorized into certain type in after; All set of tags of having comprised according to this new set of tags and before in such; Utilize above-mentioned formula (2) to recomputate such pairing core feature vector, and the core feature vector that recomputates is vectorial as such pairing new core feature.After, when another set of tags is carried out the branch time-like, this another set of tags and such new core feature vector are carried out the comparison of similarity.Therefore,, the various features of various set of tags can be taken all factors into consideration, thereby same or similar between the set of tags can be more accurately, more effectively judged according to the method for present embodiment.

With reference to figure 3 equipment that set of tags is classified is according to another embodiment of the present invention described below.Fig. 3 is the block scheme that the equipment that set of tags is classified according to another embodiment of the present invention is shown.

As shown in Figure 3, the equipment 312 that set of tags is classified comprises that mainly the synonym tally set confirms unit 300, proper vector generation unit 302, similarity calculated 304 and set of tags taxon 306.The synonym tally set is confirmed unit 300, the synonym tally set under each label of the set of tags of confirming according to a plurality of synonym tally sets of being stored in the synonym tally set database 308 to be imported.The proper vector generation unit; The corresponding proper vector of set of tags that is used to generate and is imported; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element.Similarity calculated; Be used for the similarity between the core feature vector of each type at least one type that calculated characteristics vector and class set database 310 stored, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and.Set of tags taxon 306 is categorized into the set of tags of being imported in the approaching class at least one type of being stored in the class set database 310 according to the similarity that is calculated.

The definite unit 3062 of set of tags taxon 306 types of comprising.Whether the definite unit 3062 of class surpasses predetermined threshold based on the similarity between each type in the set of tags that is calculated and at least one type, confirms whether each type is said approaching class at least one type.If there is not said approaching class at least one type, then type definite unit 3062 is categorized into said set of tags in the new class.If it is a plurality of that approaching class has, then type definite unit 3062 is categorized into set of tags in the pairing class of the maximum similarity that is calculated.

It will be understood by those skilled in the art that to provide above-mentioned a plurality of synonym tally sets with the alternate manner such as synonym tally set dictionary, also above-mentioned class can be provided otherwise.Synonym tally set database 308 is stored in the storage unit 314 with class set database 310.Storage unit 314 for example is disk, flash memory, removable storer etc.Storage unit 314 can be included in the above-mentioned equipment 312 that set of tags is classified, and perhaps is positioned at outside the above-mentioned equipment 312 that set of tags is classified and appends on the above-mentioned equipment 312 that set of tags is classified through wired or wireless mode.

Can utilize the cosine similarity come similarity and the set of tags between computation tag group and the set of tags and the class that constitutes by set of tags between similarity.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, if can calculate similarity or the set of tags between set of tags and the set of tags and the class that constitutes by set of tags between similarity.

The above-mentioned equipment 312 that set of tags is classified is actually and the above-mentioned corresponding equipment of method that set of tags is classified.Therefore, will omit its detailed description here.

Describe based on set of tags with reference to figure 4 below data are carried out method of mixing.Fig. 4 illustrates the process flow diagram that data is carried out method of mixing based on set of tags.

As shown in Figure 4, this method begins from step 400.Then, this method proceeds to step 402.In step 402, use the above-mentioned method that set of tags is classified that set of tags is categorized at least one type.Therefore; Use the above-mentioned method that set of tags is classified; Can set of tags that meet the different data format standard or user-defined different set of tags etc. dynamically be divided into different classes according to its similarity each other, and the set of tags in each type be similar each other.

Then, this method proceeds to step 404.In step 404, each label of each set of tags in same type is replaced with the specify labels in the synonym tally set under it respectively.After set of tags being divided into different classes according to above-mentioned steps 402; Can each label of each set of tags in the same class be replaced to unified label respectively; Thereby can each similar label in the same class be unified into identical set of tags; And before redescribing with resulting identical set of tags with each similar set of tags described data, with the mixing of the data that realize having similar content meaning.

Can there be the whole bag of tricks to carry out the replacement operation of each label of each set of tags in above-mentioned same type.For example; Can each label of each set of tags in same type be replaced with the specify labels in the synonym tally set under it, above-mentioned specify labels can for example be first label or last label in the synonym tally set under each label of each set of tags in same type.Perhaps, for example can add up the frequency of utilization of each synonym label in the synonym tally set under each label of each set of tags in same type to all set of tags in same type, and the synonym label that frequency of utilization is the highest is as above-mentioned specify labels.It will be understood by those skilled in the art that the method that can also adopt other carry out above-mentioned same type in the replacement operation of each label of each set of tags, as long as the specify labels after guaranteeing to replace can define corresponding data uniformly.

Then, this method proceeds to step 404.In step 404, this method finishes.

The equipment that data is mixed based on set of tags is described below with reference to Fig. 5.Fig. 5 is the block scheme that the equipment that data is mixed based on set of tags is shown.

As shown in Figure 5, the equipment 501 that data is mixed based on set of tags mainly comprises taxon 503 and replacement unit 505.The set of tags that taxon 503 uses the above-mentioned equipment that set of tags is classified will import in the data is categorized at least one type.Replacement unit 505 replaces with the specify labels in the synonym tally set under it respectively with each label of each set of tags in same type; Thereby can each similar label in the same class be unified into identical set of tags; And redescribe the data of being imported with resulting identical set of tags, have the mixing of the data of similar content meaning with realization.

The above-mentioned equipment 501 that data is mixed based on set of tags is actually with above-mentioned and based on set of tags data is carried out the corresponding equipment of method of mixing.Therefore, will omit its detailed description here.

Fig. 6 is the block diagram that the exemplary configurations of the computing machine of wherein realizing equipment of the present invention and method is shown.

In Fig. 6, CPU (CPU) 601 carries out various processing according to program stored among ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random-access memory (ram) 603.In RAM 603, also store data required when CPU 601 carries out various processing or the like as required.

CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 also is connected to bus 604.

Following parts are connected to input/output interface 605: importation 606 comprises keyboard, mouse or the like; Output 607 comprises display, such as cathode ray tube (CRT), LCD (LCD) or the like and loudspeaker or the like; Storage area 608 comprises hard disk or the like; With communications portion 609, comprise that NIC is such as LAN card, modulator-demodular unit or the like.Communications portion 609 is handled such as the Internet executive communication via network.

As required, driver 610 also is connected to input/output interface 605.Detachable media 611 is installed on the driver 610 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 608.

Realizing through software under the situation of above-mentioned steps and processing, such as detachable media 611 program that constitutes software is being installed such as the Internet or storage medium from network.

It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 6 wherein having program stored therein, distribute so that the detachable media 611 of program to be provided to the user with method with being separated.The example of detachable media 611 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk and (comprises mini-disk (MD) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 602, the storage area 608 or the like, computer program stored wherein, and be distributed to the user with the method that comprises them.

With reference to specific embodiment the present invention has been described in the instructions in front.Yet those of ordinary skill in the art understands, and under the prerequisite that does not depart from the scope of the present invention that limits like claims, can carry out various modifications and change.

Claims

1. method that set of tags is classified, wherein said set of tags comprise at least one label and by the corresponding data of said at least one label definition, said method comprises:

Synonym tally set in a plurality of synonym tally sets under each label of definite said set of tags;

Generate and the corresponding proper vector of said set of tags; In the proper vector that is generated; Each element respectively with said a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the said set of tags with the label of the corresponding synonym tally set of said element;

Calculate the similarity between the core feature vector of each type in said proper vector and at least one type, the value of each element of wherein said type core feature vector be classified into the respective element in the character pair vector of each set of tags in said type value with; And

Based on the similarity that is calculated, said set of tags is categorized in the class approaching in said at least one type.

2. method according to claim 1, wherein, said classifying step comprises:

Whether surpass predetermined threshold according to the similarity between each type in the said set of tags that is calculated and said at least one type, confirm whether each type is said approaching class in said at least one type; And

If there is not said approaching class in said at least one type, then said set of tags is categorized in the new class.

3. method according to claim 2, wherein a plurality of if said approaching class has, then said set of tags is categorized in the pairing class of the maximum similarity that is calculated.

4. like each described method among the claim 1-3, wherein, said similarity comprises the cosine similarity.

5. equipment that set of tags is classified, wherein said set of tags comprise at least one label and by the corresponding data of said at least one label definition, said equipment comprises:

The synonym tally set is confirmed the unit, is used for the synonym tally set under each label of the definite said set of tags of a plurality of synonym tally sets;

The proper vector generation unit; Be used for generating and the corresponding proper vector of said set of tags; In the proper vector that is generated; Each element respectively with said a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the said set of tags with the label of the corresponding synonym tally set of said element;

Similarity calculated; Be used for calculating the similarity between the core feature vector of said proper vector and each type of at least one type, the value of each element of wherein said type core feature vector be classified into the respective element in the character pair vector of each set of tags in said type value with; And

The set of tags taxon is used for according to the similarity that is calculated said set of tags being categorized in the approaching class of said at least one type.

6. equipment according to claim 5, wherein, said set of tags taxon comprises:

The definite unit of class is used for whether surpassing predetermined threshold according to the similarity between the said set of tags that is calculated and said each type of at least one type, confirms whether each type is said approaching class in said at least one type; And if do not have said approaching class in said at least one type, then said set of tags is categorized in the new class.

7. equipment according to claim 6, wherein, said type of definite unit also is used for: a plurality of if said approaching class has, then said set of tags is categorized in the pairing class of the maximum similarity that is calculated.

8. according to each described equipment among the claim 5-7, wherein, said similarity comprises the cosine similarity.

9. one kind is carried out method of mixing based on set of tags to data, and said method comprises:

Use like each described method that set of tags is classified among the claim 1-4 set of tags is categorized at least one type; And

Each label of each set of tags in same type is replaced with the specify labels in the synonym tally set under it respectively.

10. equipment that data is mixed based on set of tags, said equipment comprises:

Taxon is used for using like each described equipment that set of tags is classified of claim 5-8 set of tags is categorized at least one type; And

The replacement unit is used for each label with same type each set of tags and replaces with the specify labels in the synonym tally set under it respectively.