CN102750289B

CN102750289B - Based on the method and apparatus that set of tags mixes data

Info

Publication number: CN102750289B
Application number: CN201110101514.2A
Authority: CN
Inventors: 张军; 钟朝亮; 王主龙; 大木宪二; 田中昌弘; 粂照宣; 松尾昭彦
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-04-19
Filing date: 2011-04-19
Publication date: 2015-08-05
Anticipated expiration: 2031-04-19
Also published as: JP2012226740A; CN102750289A; JP5928091B2

Abstract

Disclose method and apparatus data mixed based on set of tags.The method comprises: the synonym tally set belonging to each label determining set of tags in multiple synonym tally set; Generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; According to calculated similarity, set of tags is categorized in class close at least one class; And the appointment label in the synonym tally set each label of set of tags each in same class being replaced with respectively belonging to it.

Description

Based on the method and apparatus that set of tags mixes data

Technical field

The present invention relates to data processing, relate more specifically to sorting technique and the equipment of set of tags, and data mixing method and apparatus.

Background technology

At present, there are the various data format standards being used for data of description, such as XML (eXtensibleMarkup Language, extend markup language), JSON (JavaScript ObjectNotation, JavaScript object encoding) or CSV (Comma Separated Values, comma separated value) etc.In often kind of data format standard, respectively define the label of the implication for data of description content.Such as, for the data of list type, such as the news list comprising some news, one group of label: the title (title), pubdate (issuing time), author (author) etc. for describing news content can be defined; Again such as, for the schedule table including several schedules, one group of label: starttime (start time) for describing schedule content, endtime (end time), attendees (participator) and location (place) etc. can be defined.Therefore, utilize this group label, can issue easily or visit data content.

But for data content that is identical or similar meaning, different data format standards may adopt different labels to be described.Such as, the label that for data content the people of data " create ", different data format standards may adopt " author (author) ", " writer (writer) " or " creater (creator) " etc. is different.Therefore, there is such demand: the data content identifying the same or similar implication described with different label, and describe above-mentioned same or analogous data content with unified label, thus complete the mixing of the data content of same or similar implication.

In prior art, whether same or similarly judge between multiple data content by a direct many data content itself.Because the data volume of data content itself is larger, therefore direct many data content itself, often cause calculated amount large, and the accuracy judged is also poor.

In addition, also exist in prior art by comparing between two labels the whether same or similar whether same or analogous technology of data content judged described by two labels.But there is various different data format standard in actual use, its label adopted also varies.Compare iff by label and label, be difficult to the various features considering various label, cause the accuracy of judgement also poor.

And, as mentioned above, such as the news list comprising some news, the one group of label (hereinafter referred to as " set of tags ") for describing news item content can be defined: title (title), pubdate (issuing time), author (author) etc.As can be seen here, a data content is generally defined by the set of tags comprising several labels describing this data content.Therefore, judge, between many data contents, whether there is same or similar implication, should comprehensive descision for many data contents are described multiple set of tags between whether same or similar.If only label and label compared, be then difficult to judge, with the data content described by the set of tags comprising several labels, whether there is same or similar implication.

Summary of the invention

Consider the problems referred to above, applicant recognizes and identify the data content with same or similar implication by whether more multiple set of tags is same or similar.Core concept of the present invention is, in order to whether more multiple set of tags is same or similar, first same or analogous set of tags can be divided into same class, then the class of newfound set of tags with the set of tags divided is compared.Because all set of tags in same class are all same or analogous, therefore the class of set of tags has considered the various features of various set of tags.So by the class of set of tags and set of tags being compared, what can judge between set of tags more accurately is same or similar.

According to one embodiment of present invention, provide a kind of method that set of tags is classified, the wherein set of tags corresponding data that comprises at least one label and defined by least one label, said method comprises: the synonym tally set belonging to each label determining set of tags in multiple synonym tally set, and wherein synonym tally set is the set be made up of one group of label with same or similar meaning; Generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; According to calculated similarity, set of tags is categorized in class close at least one class.

Above-mentioned classifying step comprises: whether exceed predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is close class; And if there is no close class at least one class, then set of tags is categorized in a new class.

In above-mentioned classifying step, if close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity.

Above-mentioned similarity comprises cosine similarity.

According to another embodiment of the present invention, provide a kind of equipment that set of tags is classified, the wherein set of tags corresponding data that comprises at least one label and defined by least one label, the said equipment comprises: synonym tally set determining unit, for in multiple synonym tally set, determine set of tags each label belonging to synonym tally set, wherein synonym tally set is the set be made up of one group of label with same or similar meaning; Proper vector generation unit, for generating the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity calculated, for calculate each class in proper vector and at least one class core feature vector between similarity, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; And set of tags taxon, for set of tags being categorized in class close at least one class according to calculated similarity.

Above-mentioned set of tags taxon comprises: class determining unit, for whether exceeding predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is close class; And if there is no close class at least one class, then set of tags is categorized in a new class.

Above-mentioned class determining unit also for: if close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity.

Above-mentioned similarity comprises cosine similarity.

According to another embodiment of the present invention, provide a kind of method mixed data based on set of tags, said method comprises: use and above-mentionedly to the method that set of tags is classified, set of tags is categorized at least one class; And the appointment label in the synonym tally set each label of set of tags each in same class being replaced with respectively belonging to it.

According to another embodiment of the present invention, provide a kind of equipment mixed data based on set of tags, the said equipment comprises: taxon, above-mentionedly to the equipment that set of tags is classified, set of tags is categorized at least one class for using; And replacement unit, for each label of set of tags each in same class being replaced with respectively the appointment label in the synonym tally set belonging to it.

The present invention is by comparing the similarity between the proper vector of set of tags and the core feature vector of the class of set of tags, what can more accurately, more effectively judge between set of tags is same or similar, and then can mix same or analogous data more accurately, more effectively.

Accompanying drawing explanation

Below with reference to the accompanying drawings illustrate embodiments of the invention, above and other objects, features and advantages of the present invention can be understood more easily.In the accompanying drawings, the identical or corresponding Reference numeral of employing represents by the technical characteristic of identical or correspondence or parts.

Fig. 1 illustrates according to an embodiment of the invention to the process flow diagram of the method that set of tags is classified;

Fig. 2 is the process flow diagram of the idiographic flow of the classifying step illustrated in the method for classifying to set of tags according to an embodiment of the invention;

Fig. 3 is the block scheme to the equipment that set of tags is classified illustrated according to another embodiment of the present invention;

Fig. 4 be illustrate according to another embodiment of the present invention based on the process flow diagram of set of tags to the method that data mix;

Fig. 5 be illustrate according to another embodiment of the present invention based on the block scheme of set of tags to the equipment that data mix.

Fig. 6 is the block diagram that the example arrangement wherein realizing computing machine of the present invention is shown.

Embodiment

Term used herein, is only used to describe specific embodiment, and is not intended to limit the present invention." one " and " being somebody's turn to do " of singulative used herein, is intended to also comprise plural form, unless pointed out separately clearly in context.Also to know, " comprise " word when using in this manual, feature pointed by existing, entirety, step, operation, unit and/or assembly are described, but do not get rid of and exist or increase one or more further feature, entirety, step, operation, unit and/or assembly, and/or their combination.

With reference to the accompanying drawings embodiments of the invention are described.It should be noted that for purposes of clarity, accompanying drawing and eliminate expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and process in illustrating.In each square frame of process flow diagram and/or block diagram and process flow diagram and/or block diagram, the combination of each square frame, can be realized by computer program instructions.These computer program instructions can be supplied to the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus, thus produce a kind of machine, make these instructions performed by computing machine or other programmable data treating apparatus, produce the device of the function/operation specified in the square frame in realization flow figure and/or block diagram.

Also these computer program instructions can be stored in can in the computer-readable medium that works in a specific way of instructs computer or other programmable data treating apparatus, like this, the instruction be stored in computer-readable medium produces the manufacture of the command device (instruction means) of the function/operation specified in a square frame comprising in realization flow figure and/or block diagram.

Also can computer program instructions be loaded on computing machine or other programmable data treating apparatus, make to perform sequence of operations step on computing machine or other programmable data treating apparatus, to produce computer implemented process, thus the instruction performed on computing machine or other programmable device just provides the process of the function/operation specified in the square frame in realization flow figure and/or block diagram.

It should be understood that the process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of various embodiments of the invention, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.

Describe according to an embodiment of the invention to the method that set of tags is classified below with reference to Fig. 1.Fig. 1 illustrates according to an embodiment of the invention to the process flow diagram of the method that set of tags is classified.

As shown in Figure 1, the method is from step 100.Then, in a step 102, in multiple synonym tally set, determine the synonym tally set belonging to each label of set of tags.

Synonym tally set (S) is the set be made up of one group of label with same or similar meaning (i.e. synonym).Exemplarily, several synonym tally sets following can be there are:

S ₁: author (author), creator (creator), writer (writer)

S ₂: pubdate (announcing the time), publishdate (issuing time)

S ₃: URL (uniform resource locator), link (link)

S ₄: summary (summary), description (general introduction)

S ₅: event (event), title (title), what (what)

S ₆: starttime (start time), when (when)

S ₇: where (where), location (place)

S _n: who (who), attendees (participator)

Wherein, n be more than or equal to 1 integer.

Above-mentioned synonym tally set is only example, can also there is other synonym tally set as required.Which label predefined can be carried out according to the experience in reality use and represent same or analogous meaning.In addition, also can in use constantly the newfound label with same or similar meaning be added in above-mentioned synonym tally set, to dynamically update above-mentioned synonym tally set.Above-mentioned synonym tally set can be provided with the form of such as synonym dictionary.It will be understood by those skilled in the art that can also with the alternate manner of such as database to provide above-mentioned synonym tally set.

Set of tags (T) is the set be made up of one group of label of the corresponding data be respectively used in definition Data Entry.Exemplarily, several set of tags following can be there are:

T ₁: title (title), author (author), pubdate (announcing the time), summary (summary)

T ₂: title (title), publishdate (issuing time), creator (founder), description (general introduction), URL (uniform resource locator)

T ₃: title (title), link (link), writer (writer), description (general introduction)

T ₄: title (title), link (link), writer (writer), description (general introduction)

T ₅: event (event), starttime (start time), endtime (end time), location (place), attendees (participator)

T ₆: title (title), starttime (start time), duration (duration), where (where), attendees (participator)

T _p: what (what), where (where), who (who), when (when)

Wherein, p be more than or equal to 1 integer.

Above-mentioned set of tags is only example, can also there is other set of tags in actual use.Such as, different data format standards (such as, XML, JSON or CSV etc.) can define different set of tags, or the publisher of data also can according to the self-defined different set of tags of the needs of oneself.

For a new set of tags, the synonym tally set described in each label in new set of tags can be determined according to above-mentioned synonym tally set.Such as, for above-mentioned set of tags T ₁, can according to set of tags T ₁in the order of each label determine successively: set of tags T ₁in label " title (title) " belong to synonym tally set S ₅(i.e. set of tags T ₁in belong to synonym tally set S ₅number of labels be 1), set of tags T ₁in label " author (author) " belong to synonym tally set S ₁(i.e. set of tags T ₁in belong to synonym tally set S ₁number of tags be 1), set of tags T ₁in label " announce time " belong to synonym tally set S ₂(i.e. set of tags T ₁in belong to synonym tally set S ₂number of tags be 1), and set of tags T ₁in label " summary (summary) " belong to synonym tally set S ₄(i.e. set of tags T ₁in belong to synonym tally set S ₄number of tags be 1).In addition, for above-mentioned set of tags T ₁, also can according to above-mentioned synonym tally set S ₁to synonym S _norder determine successively: set of tags T ₁in belong to synonym tally set S ₁number of tags be 1, tally set T ₁in belong to synonym tally set S ₂number of tags be 1, tally set T ₁in belong to synonym tally set S ₃number of tags be 0, tally set T ₁in belong to synonym tally set S ₄number of tags be 1, tally set T ₁in belong to synonym tally set S ₅number of tags be 1, tally set T ₁in belong to synonym tally set S ₆number of tags be 0, and tally set T ₁in belong to synonym tally set S ₇to synonym tally set S _nnumber of tags be 0.Above-mentioned set of tags T can be determined respectively after the same method ₂to set of tags T _pin each set of tags in each label belong to above-mentioned synonym tally set S respectively ₁to synonym tally set S _nin which tally set.

Then, the method proceeds to step 104.At step 104, generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags.

According to the determination result of above-mentioned steps 102, the proper vector corresponding with set of tags can be generated.Such as, for set of tags T ₁, correspond to according to set of tags T ₁in the determination result of order of each label, can generate and set of tags T ₁corresponding proper vector A:(S ₅: 1, S ₁: 1, S ₂: 1, S ₄: 1), wherein, the part in each element before colon represents the synonym tally set corresponding to this element, and the part in each element after colon represents in set of tags 1 number of the label of the synonym tally set belonging to corresponding with this element.Such as, for first element " S of proper vector A ₅: 1 ", " S ₅" represent that this first element corresponds to synonym tally set S ₅, and " 1 " represents set of tags T ₁in belong to synonym tally set S ₅the number of label be 1.In addition, for set of tags T ₁, correspond to according to above-mentioned synonym tally set S ₁to synonym tally set S _nthe determination result of order, can generate and set of tags T ₁corresponding proper vector A ': (S ₁: 1, S ₂: 1, S ₃: 0, S ₄: 1, S ₅: 1, S ₆: 0, S ₇: 0 ..., S _n: 0), wherein the implication of each several part of each element and identical in above-mentioned proper vector A, do not repeat them here.After the same method, can generate respectively and above-mentioned set of tags T ₁to set of tags T _pin the corresponding proper vector of each set of tags.

Then, the method proceeds to step 106.In step 106, similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and.

Class is by mutually the same or similar one group of set that set of tags is formed, and namely belonging to each set of tags of a sort is same or analogous each other.Can such as whether same or similarly judge between set of tags according to the COS distance between set of tags.Below the process of the COS distance calculated between set of tags is described.

Suppose generate the proper vector A corresponding with set of tags T1 according to above-mentioned steps 104 and generate and set of tags T ₂corresponding proper vector B, wherein, proper vector A can be expressed as (S ₁: f _a1, S ₂: f _a2..., S _n: f _an), can (f be abbreviated as _a1, f _a2..., f _an); Proper vector B can be expressed as (S ₁: f _b1, S ₂: f _b2..., S _n: f _bn), can (f be abbreviated as _b1, f _b2..., f _bn).Wherein, S _nsynonym tally set S in representation feature vector A or proper vector B corresponding to the n-th element _n, f _anrepresent set of tags T ₁in belong to the synonym tally set S corresponding with the n-th element in proper vector A _nthe number of label, f _bnrepresent set of tags T ₂in belong to the synonym tally set S corresponding with the n-th element in proper vector B _nthe number of label.Can calculate corresponding to set of tags T with following formula (1) ₁proper vector A with correspond to set of tags T ₂proper vector B between cosine similarity:

Similarity (A, B)=(∑ f _ak× f _bk)/sqrt [(∑ f _ak× f _ak) × (∑ f _bk× f _bk)] formula (1)

Wherein, 1≤k≤n, n be more than or equal to 1 integer.

For the class be made up of one group of set of tags, such as the core feature vector corresponding to class can be obtained by the mode of the respective element in each proper vector corresponding to each set of tags in cumulative class.Such as, suppose that there is in class C the set of tags T be classified in class C ₁to set of tags T _m(m be more than or equal to 1 integer), and set of tags T ₁to set of tags T _mcorresponding proper vector is proper vector A respectively ₁to proper vector A _m, then the core feature vector A corresponding to class C _ccan represent with following formula (2):

A _c=(∑ f _aj1, ∑ f _aj2..., ∑ f _ajn) formula (2)

Wherein 1≤j≤m, m be more than or equal to 1 integer.

The vector of the core feature corresponding to class C A is being calculated according to formula (2) _cafterwards, above-mentioned formula (1) can be used to calculate a new set of tags T _nEcorresponding proper vector A _nEwith the core feature vector A corresponding to class C _cbetween similarity.If there is multiple class, then calculate new set of tags T respectively _nEcorresponding proper vector A _nEand the similarity between the core feature vector corresponding to each class in multiple class.

Then, the method proceeds to step 108.In step 108, according to calculated similarity, set of tags is categorized in class close at least one class.

The size of the value of the proper vector corresponding to the set of tags that above-mentioned formula (1) calculates and the cosine similarity between the core feature vector corresponding to class illustrates the similarity degree between set of tags and class, and the value of cosine similarity is larger, then more similar between set of tags to class.Therefore, according to calculated similarity, can judge that whether set of tags is similar to class, thus set of tags is categorized in the class close to (namely similar).

Finally, the method proceeds to step 110.In step 110, the method terminates.

Described above is according to an embodiment of the invention to the overall flow of the method that set of tags is classified.The idiographic flow of the classifying step in the above-mentioned method that set of tags is classified is described in below with reference to Fig. 2.Fig. 2 is the process flow diagram of the idiographic flow of the classifying step illustrated in the method for classifying to set of tags according to an embodiment of the invention.

As shown in Figure 2, after calculate the similarity between the core feature vector corresponding to the proper vector corresponding to set of tags and each class in multiple class respectively according to above-mentioned steps 106, the method proceeds to step 200.In step 200, the similarity in the set of tags calculated and at least one class between each class and predetermined threshold are compared.This predetermined threshold can preset as required, and can adjust as required in actual use.By adjusting the size of threshold value, the precision that set of tags is classified can be controlled.

Suppose current existence 3 classes be made up of set of tags, be expressed as C ₁, C ₂and C ₃.Class C ₁, C ₂and C ₃corresponding core feature vector is respectively A ₁, A ₂and A ₃.As the set of tags T that discovery one is new _nEtime, determine this new set of tags T _nEcorresponding proper vector is A _nE.Calculate proper vector A respectively _nEwith core feature vector A ₁, A ₂and A ₃between similarity.Such as, when adopting cosine similarity, the value of the similarity calculated can be respectively 0.92,0.85 and 0.79.After the value calculating above-mentioned similarity, the value 0.92,0.85 and 0.79 of above-mentioned similarity is compared with predetermined threshold respectively.

Then, the method proceeds to step 202.In step 202., judge whether the similarity between each class in the set of tags that calculates and at least one class exceedes predetermined threshold.If the judged result of step 202 is "No", namely set of tags and all classes all dissimilar, then proceed to step 206.In step 206, set of tags be categorized in a new class, the class making this new comprises this set of tags.

In the above example, suppose that predetermined threshold is 0.93.Because the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities does not all exceed predetermined threshold 0.93, therefore new set of tags T _nEwith current class C ₁, C ₂and C ₃equal dissmilarity.Now, a new class C can be set up ₄, and by new set of tags T _nEbe categorized into new class C ₄in, make new class C ₄comprise new set of tags T _nE.

If the judged result of step 202 is "Yes", then proceed to step 204.In step 204, judge whether that the class be greater than corresponding to the similarity of predetermined threshold has multiple, namely judge whether the similarity between set of tags and multiple class is all greater than predetermined threshold.If the judged result of step 204 is "No", represent that the similarity of set of tags only and between some classes is greater than predetermined threshold, the number being namely greater than the similarity of predetermined threshold is 1, then proceed to step 210.In step 210, set of tags is categorized in that class corresponding to the calculated similarity uniquely exceeding predetermined threshold.

In the above example, suppose that predetermined threshold is 0.90.Due in the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities, the value 0.92 of similarity is only had to exceed predetermined threshold 0.90, therefore by new set of tags T _nEbe categorized into the class C corresponding to value 0.92 of above-mentioned similarity ₁in.

If the judged result of step 204 is "Yes", represent that the similarity between set of tags and multiple class is greater than predetermined threshold, the number being namely greater than the similarity of predetermined threshold is multiple, then proceed in step 208.In a step 208, select to be greater than similarity maximum in multiple similarities of predetermined threshold, and set of tags is categorized in that class corresponding to selected maximum similarity.

In the above example, suppose that predetermined threshold is 0.80.Due in the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities, the value 0.92 and 0.85 of similarity all exceedes predetermined threshold 0.80, therefore in the value 0.92 and 0.85 of similarity exceeding predetermined threshold 0.80, the value of maximum similarity is selected, i.e. the value 0.92 of similarity.Then, by new set of tags T _nEbe categorized into the class C corresponding to value 0.92 of above-mentioned maximum similarity ₁in.

After step 206,208 and 210, proceed to step 212.In the step 212, the method stops.

Hereinbefore, utilize cosine similarity to calculate similarity between set of tags and set of tags and set of tags and by the similarity between the class that set of tags is formed.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, as long as the similarity that can calculate between set of tags and set of tags or set of tags and by the similarity between the class that set of tags is formed.

Hereinbefore, the quantity of included in class set of tags dynamically increases.After being categorized in certain class according to above-mentioned method of classifying to set of tags by set of tags, the quantity of set of tags included in such just increases one.Preferably, can after new set of tags be categorized in certain class, according to this new set of tags and in such before all set of tags of having comprised, the core feature utilizing above-mentioned formula (2) to recalculate corresponding to such is vectorial, and using vectorial as the new core feature corresponding to such for the core feature vector recalculated.After, when classifying to another set of tags, this another set of tags is carried out comparing of similarity with such new core feature vector.Therefore, according to the method for the present embodiment, the various features of various set of tags can be considered, thus it is same or similar more accurately, more effectively to judge between set of tags.

Below with reference to Fig. 3, the equipment of classifying to set of tags is according to another embodiment of the present invention described.Fig. 3 is the block scheme to the equipment that set of tags is classified illustrated according to another embodiment of the present invention.

As shown in Figure 3, the equipment 312 of classifying to set of tags mainly comprises synonym tally set determining unit 300, proper vector generation unit 302, similarity calculated 304 and set of tags taxon 306.Synonym tally set determining unit 300, determines the synonym tally set belonging to each label of inputted set of tags according to the multiple synonym tally sets stored in synonym tally set database 308.Proper vector generation unit, for generating the proper vector corresponding with inputted set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags.Similarity calculated, for calculate each class at least one class of storing in proper vector and class set database 310 core feature vector between similarity, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and.Set of tags taxon 306, to be categorized into inputted set of tags according to calculated similarity in the close class at least one class stored in class set database 310.

Set of tags taxon 306 comprises class determining unit 3062.Whether class determining unit 3062 exceedes predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is described close class.If do not have described close class at least one class, then described set of tags is categorized in a new class by class determining unit 3062.If close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity by class determining unit 3062.

It will be understood by those skilled in the art that and with the alternate manner of such as synonym tally set dictionary to provide above-mentioned multiple synonym tally sets, also otherwise can also can provide above-mentioned class.Synonym tally set database 308 and class set database 310 are stored in storage unit 314.Storage unit 314 is such as disk, flash memory, removable storer etc.Storage unit 314 can be included in above-mentioned equipment 312 of classifying to set of tags, or to be positioned at outside above-mentioned equipment 312 of classifying to set of tags and to be attached on above-mentioned equipment 312 of classifying to set of tags by wired or wireless mode.

Cosine similarity can be utilized to calculate similarity between set of tags and set of tags and set of tags and by the similarity between the class that set of tags is formed.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, as long as the similarity that can calculate between set of tags and set of tags or set of tags and by the similarity between the class that set of tags is formed.

Above-mentioned equipment 312 of classifying to set of tags is actually the equipment corresponding with above-mentioned method of classifying to set of tags.Therefore, will be omitted it herein describe in detail.

The method mixed data based on set of tags is described below with reference to Fig. 4.Fig. 4 illustrates based on the process flow diagram of set of tags to the method that data mix.

As shown in Figure 4, the method is from step 400.Then, the method proceeds to step 402.In step 402, use and above-mentioned by set of tags, at least one class is categorized into the method that set of tags is classified.Therefore, use the above-mentioned method that set of tags is classified, can the set of tags or user-defined different set of tags etc. of different data format standard be met, be dynamically divided into different classes according to its similarity each other, and the set of tags in each class be similar each other.

Then, the method proceeds to step 404.In step 404, each label of set of tags each in same class is replaced with respectively the appointment label in the synonym tally set belonging to it.After according to above-mentioned steps 402 set of tags being divided into different classes, each label of set of tags each in same class can be replaced to unified label respectively, thus each label similar in same class can be unified into identical set of tags, and by the data described by each similar set of tags before redescribing by obtained identical set of tags, to realize the mixing of the data with Similar content meaning.

Various method can be had to carry out the replacement operation of each label of each set of tags in above-mentioned same class.Such as, each label of set of tags each in same class can be replaced with the appointment label in the synonym tally set belonging to it, above-mentioned appointment label can be such as first label in the synonym tally set belonging to each label of each set of tags in same class or last label.Or, the frequency of utilization of each synonym label in the synonym tally set belonging to each label of each set of tags in same class such as can be added up for set of tags all in same class, and using synonym label the highest for frequency of utilization as above-mentioned appointment label.It will be understood by those skilled in the art that and other method can also be adopted to carry out the replacement operation of each label of each set of tags in above-mentioned same class, as long as can ensure that the appointment label after replacing can define corresponding data uniformly.

Then, the method proceeds to step 404.In step 404, the method terminates.

The equipment mixed data based on set of tags is described below with reference to Fig. 5.Fig. 5 illustrates based on the block scheme of set of tags to the equipment that data mix.

As shown in Figure 5, based on set of tags, taxon 503 and replacement unit 505 are mainly comprised to the equipment 501 that data mix.Taxon 503 uses above-mentioned equipment of classifying to set of tags that the set of tags in input data is categorized at least one class.Each label of set of tags each in same class is replaced with the appointment label in the synonym tally set belonging to it by replacement unit 505 respectively, thus each label similar in same class can be unified into identical set of tags, and redescribe inputted data by obtained identical set of tags, to realize the mixing of the data with Similar content meaning.

Above-mentionedly based on set of tags, the equipment corresponding with the above-mentioned method mixed data based on set of tags is actually to the equipment 501 that data mix.Therefore, will be omitted it herein describe in detail.

Fig. 6 is the block diagram of the example arrangement that the computing machine wherein realizing equipment of the present invention and method is shown.

In figure 6, CPU (central processing unit) (CPU) 601 performs various process according to the program stored in ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random access memory (RAM) 603.In RAM 603, also store the data required when CPU 601 performs various process etc. as required.

CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 is also connected to bus 604.

Following parts are connected to input/output interface 605: importation 606, comprise keyboard, mouse etc.; Output 607, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 608, comprises hard disk etc.; With communications portion 609, comprise network interface unit such as LAN card, modulator-demodular unit etc.Communications portion 609 is via network such as the Internet executive communication process.

As required, driver 610 is also connected to input/output interface 605.Detachable media 611 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 610 as required, and the computer program therefrom read is installed in storage area 608 as required.

When by software simulating above-mentioned steps and process, from network such as the Internet or storage medium, such as detachable media 611 installs the program forming software.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 6, distributes the detachable media 611 to provide program to user separately with method.The example of detachable media 611 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprise mini-disk (MD) and semiconductor memory.Or hard disk that storage medium can be ROM 602, comprise in storage area 608 etc., wherein computer program stored, and user is distributed to together with comprising their method.

The present invention is described with reference to specific embodiment in instructions above.But those of ordinary skill in the art understands, do not departing under the prerequisite as the scope of the present invention of claims restriction and can carry out various amendment and change.

Claims

1. based on the method that set of tags mixes data, the corresponding data that wherein said set of tags comprises at least one label and defined by least one label described, described method comprises:

In multiple synonym tally set, determine the synonym tally set belonging to each label of described set of tags, wherein said synonym tally set is the set be made up of one group of label with same or similar meaning;

Generate the proper vector corresponding with described set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in described multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with described element in described set of tags;

Similarity between the core feature vector calculating each class in described proper vector and at least one class, the value of each element of the core feature vector of wherein said class be the value of respective element in the character pair vector of each set of tags be classified in described class and;

According to calculated similarity, described set of tags is categorized in class close at least one class described; And

Each label of set of tags each in same class is replaced with respectively the appointment label in the synonym tally set belonging to it.

2. method according to claim 1, wherein, described classifying step comprises:

Whether exceed predetermined threshold according to the similarity in calculated described set of tags and at least one class described between each class, determine at least one class described, whether each class is described close class; And

If there is no described close class at least one class described, then described set of tags is categorized in a new class.

3. method according to claim 2, wherein, if described close class has multiple, is then categorized into described set of tags in the class corresponding to calculated maximum similarity.

4. the method according to any one of claim 1-3, wherein, described similarity comprises cosine similarity.

5. based on the equipment that set of tags mixes data, the corresponding data that wherein said set of tags comprises at least one label and defined by least one label described, described equipment comprises:

Synonym tally set determining unit, in multiple synonym tally set, determine described set of tags each label belonging to synonym tally set, wherein said synonym tally set is the set be made up of one group of label with same or similar meaning;

Proper vector generation unit, for generating the proper vector corresponding with described set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in described multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with described element in described set of tags;

Similarity calculated, for calculate each class in described proper vector and at least one class core feature vector between similarity, the value of each element of the core feature vector of wherein said class be the value of respective element in the character pair vector of each set of tags be classified in described class and;

Set of tags taxon, for being categorized in class close at least one class described according to calculated similarity by described set of tags; And

Replacement unit, for replacing with the appointment label in the synonym tally set belonging to it respectively by each label of set of tags each in same class.

6. equipment according to claim 5, wherein, described set of tags taxon comprises:

Class determining unit, for whether exceeding predetermined threshold according to the similarity in calculated described set of tags and at least one class described between each class, determines at least one class described, whether each class is described close class; And if there is no described close class at least one class described, then described set of tags is categorized in a new class.

7. equipment according to claim 6, wherein, described class determining unit also for: if described close class has multiple, then described set of tags is categorized in the class corresponding to calculated maximum similarity.

8. the equipment according to any one of claim 5-7, wherein, described similarity comprises cosine similarity.