CN102750289B - Based on the method and apparatus that set of tags mixes data - Google Patents

Based on the method and apparatus that set of tags mixes data Download PDF

Info

Publication number
CN102750289B
CN102750289B CN201110101514.2A CN201110101514A CN102750289B CN 102750289 B CN102750289 B CN 102750289B CN 201110101514 A CN201110101514 A CN 201110101514A CN 102750289 B CN102750289 B CN 102750289B
Authority
CN
China
Prior art keywords
tags
class
synonym
label
tally
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110101514.2A
Other languages
Chinese (zh)
Other versions
CN102750289A (en
Inventor
张军
钟朝亮
王主龙
大木宪二
田中昌弘
粂照宣
松尾昭彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201110101514.2A priority Critical patent/CN102750289B/en
Priority to JP2012079208A priority patent/JP5928091B2/en
Publication of CN102750289A publication Critical patent/CN102750289A/en
Application granted granted Critical
Publication of CN102750289B publication Critical patent/CN102750289B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Disclose method and apparatus data mixed based on set of tags.The method comprises: the synonym tally set belonging to each label determining set of tags in multiple synonym tally set; Generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; According to calculated similarity, set of tags is categorized in class close at least one class; And the appointment label in the synonym tally set each label of set of tags each in same class being replaced with respectively belonging to it.

Description

Based on the method and apparatus that set of tags mixes data
Technical field
The present invention relates to data processing, relate more specifically to sorting technique and the equipment of set of tags, and data mixing method and apparatus.
Background technology
At present, there are the various data format standards being used for data of description, such as XML (eXtensibleMarkup Language, extend markup language), JSON (JavaScript ObjectNotation, JavaScript object encoding) or CSV (Comma Separated Values, comma separated value) etc.In often kind of data format standard, respectively define the label of the implication for data of description content.Such as, for the data of list type, such as the news list comprising some news, one group of label: the title (title), pubdate (issuing time), author (author) etc. for describing news content can be defined; Again such as, for the schedule table including several schedules, one group of label: starttime (start time) for describing schedule content, endtime (end time), attendees (participator) and location (place) etc. can be defined.Therefore, utilize this group label, can issue easily or visit data content.
But for data content that is identical or similar meaning, different data format standards may adopt different labels to be described.Such as, the label that for data content the people of data " create ", different data format standards may adopt " author (author) ", " writer (writer) " or " creater (creator) " etc. is different.Therefore, there is such demand: the data content identifying the same or similar implication described with different label, and describe above-mentioned same or analogous data content with unified label, thus complete the mixing of the data content of same or similar implication.
In prior art, whether same or similarly judge between multiple data content by a direct many data content itself.Because the data volume of data content itself is larger, therefore direct many data content itself, often cause calculated amount large, and the accuracy judged is also poor.
In addition, also exist in prior art by comparing between two labels the whether same or similar whether same or analogous technology of data content judged described by two labels.But there is various different data format standard in actual use, its label adopted also varies.Compare iff by label and label, be difficult to the various features considering various label, cause the accuracy of judgement also poor.
And, as mentioned above, such as the news list comprising some news, the one group of label (hereinafter referred to as " set of tags ") for describing news item content can be defined: title (title), pubdate (issuing time), author (author) etc.As can be seen here, a data content is generally defined by the set of tags comprising several labels describing this data content.Therefore, judge, between many data contents, whether there is same or similar implication, should comprehensive descision for many data contents are described multiple set of tags between whether same or similar.If only label and label compared, be then difficult to judge, with the data content described by the set of tags comprising several labels, whether there is same or similar implication.
Summary of the invention
Consider the problems referred to above, applicant recognizes and identify the data content with same or similar implication by whether more multiple set of tags is same or similar.Core concept of the present invention is, in order to whether more multiple set of tags is same or similar, first same or analogous set of tags can be divided into same class, then the class of newfound set of tags with the set of tags divided is compared.Because all set of tags in same class are all same or analogous, therefore the class of set of tags has considered the various features of various set of tags.So by the class of set of tags and set of tags being compared, what can judge between set of tags more accurately is same or similar.
According to one embodiment of present invention, provide a kind of method that set of tags is classified, the wherein set of tags corresponding data that comprises at least one label and defined by least one label, said method comprises: the synonym tally set belonging to each label determining set of tags in multiple synonym tally set, and wherein synonym tally set is the set be made up of one group of label with same or similar meaning; Generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; According to calculated similarity, set of tags is categorized in class close at least one class.
Above-mentioned classifying step comprises: whether exceed predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is close class; And if there is no close class at least one class, then set of tags is categorized in a new class.
In above-mentioned classifying step, if close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity.
Above-mentioned similarity comprises cosine similarity.
According to another embodiment of the present invention, provide a kind of equipment that set of tags is classified, the wherein set of tags corresponding data that comprises at least one label and defined by least one label, the said equipment comprises: synonym tally set determining unit, for in multiple synonym tally set, determine set of tags each label belonging to synonym tally set, wherein synonym tally set is the set be made up of one group of label with same or similar meaning; Proper vector generation unit, for generating the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags; Similarity calculated, for calculate each class in proper vector and at least one class core feature vector between similarity, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and; And set of tags taxon, for set of tags being categorized in class close at least one class according to calculated similarity.
Above-mentioned set of tags taxon comprises: class determining unit, for whether exceeding predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is close class; And if there is no close class at least one class, then set of tags is categorized in a new class.
Above-mentioned class determining unit also for: if close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity.
Above-mentioned similarity comprises cosine similarity.
According to another embodiment of the present invention, provide a kind of method mixed data based on set of tags, said method comprises: use and above-mentionedly to the method that set of tags is classified, set of tags is categorized at least one class; And the appointment label in the synonym tally set each label of set of tags each in same class being replaced with respectively belonging to it.
According to another embodiment of the present invention, provide a kind of equipment mixed data based on set of tags, the said equipment comprises: taxon, above-mentionedly to the equipment that set of tags is classified, set of tags is categorized at least one class for using; And replacement unit, for each label of set of tags each in same class being replaced with respectively the appointment label in the synonym tally set belonging to it.
The present invention is by comparing the similarity between the proper vector of set of tags and the core feature vector of the class of set of tags, what can more accurately, more effectively judge between set of tags is same or similar, and then can mix same or analogous data more accurately, more effectively.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, above and other objects, features and advantages of the present invention can be understood more easily.In the accompanying drawings, the identical or corresponding Reference numeral of employing represents by the technical characteristic of identical or correspondence or parts.
Fig. 1 illustrates according to an embodiment of the invention to the process flow diagram of the method that set of tags is classified;
Fig. 2 is the process flow diagram of the idiographic flow of the classifying step illustrated in the method for classifying to set of tags according to an embodiment of the invention;
Fig. 3 is the block scheme to the equipment that set of tags is classified illustrated according to another embodiment of the present invention;
Fig. 4 be illustrate according to another embodiment of the present invention based on the process flow diagram of set of tags to the method that data mix;
Fig. 5 be illustrate according to another embodiment of the present invention based on the block scheme of set of tags to the equipment that data mix.
Fig. 6 is the block diagram that the example arrangement wherein realizing computing machine of the present invention is shown.
Embodiment
Term used herein, is only used to describe specific embodiment, and is not intended to limit the present invention." one " and " being somebody's turn to do " of singulative used herein, is intended to also comprise plural form, unless pointed out separately clearly in context.Also to know, " comprise " word when using in this manual, feature pointed by existing, entirety, step, operation, unit and/or assembly are described, but do not get rid of and exist or increase one or more further feature, entirety, step, operation, unit and/or assembly, and/or their combination.
With reference to the accompanying drawings embodiments of the invention are described.It should be noted that for purposes of clarity, accompanying drawing and eliminate expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and process in illustrating.In each square frame of process flow diagram and/or block diagram and process flow diagram and/or block diagram, the combination of each square frame, can be realized by computer program instructions.These computer program instructions can be supplied to the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus, thus produce a kind of machine, make these instructions performed by computing machine or other programmable data treating apparatus, produce the device of the function/operation specified in the square frame in realization flow figure and/or block diagram.
Also these computer program instructions can be stored in can in the computer-readable medium that works in a specific way of instructs computer or other programmable data treating apparatus, like this, the instruction be stored in computer-readable medium produces the manufacture of the command device (instruction means) of the function/operation specified in a square frame comprising in realization flow figure and/or block diagram.
Also can computer program instructions be loaded on computing machine or other programmable data treating apparatus, make to perform sequence of operations step on computing machine or other programmable data treating apparatus, to produce computer implemented process, thus the instruction performed on computing machine or other programmable device just provides the process of the function/operation specified in the square frame in realization flow figure and/or block diagram.
It should be understood that the process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of various embodiments of the invention, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.
Describe according to an embodiment of the invention to the method that set of tags is classified below with reference to Fig. 1.Fig. 1 illustrates according to an embodiment of the invention to the process flow diagram of the method that set of tags is classified.
As shown in Figure 1, the method is from step 100.Then, in a step 102, in multiple synonym tally set, determine the synonym tally set belonging to each label of set of tags.
Synonym tally set (S) is the set be made up of one group of label with same or similar meaning (i.e. synonym).Exemplarily, several synonym tally sets following can be there are:
S 1: author (author), creator (creator), writer (writer)
S 2: pubdate (announcing the time), publishdate (issuing time)
S 3: URL (uniform resource locator), link (link)
S 4: summary (summary), description (general introduction)
S 5: event (event), title (title), what (what)
S 6: starttime (start time), when (when)
S 7: where (where), location (place)
S n: who (who), attendees (participator)
Wherein, n be more than or equal to 1 integer.
Above-mentioned synonym tally set is only example, can also there is other synonym tally set as required.Which label predefined can be carried out according to the experience in reality use and represent same or analogous meaning.In addition, also can in use constantly the newfound label with same or similar meaning be added in above-mentioned synonym tally set, to dynamically update above-mentioned synonym tally set.Above-mentioned synonym tally set can be provided with the form of such as synonym dictionary.It will be understood by those skilled in the art that can also with the alternate manner of such as database to provide above-mentioned synonym tally set.
Set of tags (T) is the set be made up of one group of label of the corresponding data be respectively used in definition Data Entry.Exemplarily, several set of tags following can be there are:
T 1: title (title), author (author), pubdate (announcing the time), summary (summary)
T 2: title (title), publishdate (issuing time), creator (founder), description (general introduction), URL (uniform resource locator)
T 3: title (title), link (link), writer (writer), description (general introduction)
T 4: title (title), link (link), writer (writer), description (general introduction)
T 5: event (event), starttime (start time), endtime (end time), location (place), attendees (participator)
T 6: title (title), starttime (start time), duration (duration), where (where), attendees (participator)
T p: what (what), where (where), who (who), when (when)
Wherein, p be more than or equal to 1 integer.
Above-mentioned set of tags is only example, can also there is other set of tags in actual use.Such as, different data format standards (such as, XML, JSON or CSV etc.) can define different set of tags, or the publisher of data also can according to the self-defined different set of tags of the needs of oneself.
For a new set of tags, the synonym tally set described in each label in new set of tags can be determined according to above-mentioned synonym tally set.Such as, for above-mentioned set of tags T 1, can according to set of tags T 1in the order of each label determine successively: set of tags T 1in label " title (title) " belong to synonym tally set S 5(i.e. set of tags T 1in belong to synonym tally set S 5number of labels be 1), set of tags T 1in label " author (author) " belong to synonym tally set S 1(i.e. set of tags T 1in belong to synonym tally set S 1number of tags be 1), set of tags T 1in label " announce time " belong to synonym tally set S 2(i.e. set of tags T 1in belong to synonym tally set S 2number of tags be 1), and set of tags T 1in label " summary (summary) " belong to synonym tally set S 4(i.e. set of tags T 1in belong to synonym tally set S 4number of tags be 1).In addition, for above-mentioned set of tags T 1, also can according to above-mentioned synonym tally set S 1to synonym S norder determine successively: set of tags T 1in belong to synonym tally set S 1number of tags be 1, tally set T 1in belong to synonym tally set S 2number of tags be 1, tally set T 1in belong to synonym tally set S 3number of tags be 0, tally set T 1in belong to synonym tally set S 4number of tags be 1, tally set T 1in belong to synonym tally set S 5number of tags be 1, tally set T 1in belong to synonym tally set S 6number of tags be 0, and tally set T 1in belong to synonym tally set S 7to synonym tally set S nnumber of tags be 0.Above-mentioned set of tags T can be determined respectively after the same method 2to set of tags T pin each set of tags in each label belong to above-mentioned synonym tally set S respectively 1to synonym tally set S nin which tally set.
Then, the method proceeds to step 104.At step 104, generate the proper vector corresponding with set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags.
According to the determination result of above-mentioned steps 102, the proper vector corresponding with set of tags can be generated.Such as, for set of tags T 1, correspond to according to set of tags T 1in the determination result of order of each label, can generate and set of tags T 1corresponding proper vector A:(S 5: 1, S 1: 1, S 2: 1, S 4: 1), wherein, the part in each element before colon represents the synonym tally set corresponding to this element, and the part in each element after colon represents in set of tags 1 number of the label of the synonym tally set belonging to corresponding with this element.Such as, for first element " S of proper vector A 5: 1 ", " S 5" represent that this first element corresponds to synonym tally set S 5, and " 1 " represents set of tags T 1in belong to synonym tally set S 5the number of label be 1.In addition, for set of tags T 1, correspond to according to above-mentioned synonym tally set S 1to synonym tally set S nthe determination result of order, can generate and set of tags T 1corresponding proper vector A ': (S 1: 1, S 2: 1, S 3: 0, S 4: 1, S 5: 1, S 6: 0, S 7: 0 ..., S n: 0), wherein the implication of each several part of each element and identical in above-mentioned proper vector A, do not repeat them here.After the same method, can generate respectively and above-mentioned set of tags T 1to set of tags T pin the corresponding proper vector of each set of tags.
Then, the method proceeds to step 106.In step 106, similarity between the core feature vector calculating each class in proper vector and at least one class, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and.
Class is by mutually the same or similar one group of set that set of tags is formed, and namely belonging to each set of tags of a sort is same or analogous each other.Can such as whether same or similarly judge between set of tags according to the COS distance between set of tags.Below the process of the COS distance calculated between set of tags is described.
Suppose generate the proper vector A corresponding with set of tags T1 according to above-mentioned steps 104 and generate and set of tags T 2corresponding proper vector B, wherein, proper vector A can be expressed as (S 1: f a1, S 2: f a2..., S n: f an), can (f be abbreviated as a1, f a2..., f an); Proper vector B can be expressed as (S 1: f b1, S 2: f b2..., S n: f bn), can (f be abbreviated as b1, f b2..., f bn).Wherein, S nsynonym tally set S in representation feature vector A or proper vector B corresponding to the n-th element n, f anrepresent set of tags T 1in belong to the synonym tally set S corresponding with the n-th element in proper vector A nthe number of label, f bnrepresent set of tags T 2in belong to the synonym tally set S corresponding with the n-th element in proper vector B nthe number of label.Can calculate corresponding to set of tags T with following formula (1) 1proper vector A with correspond to set of tags T 2proper vector B between cosine similarity:
Similarity (A, B)=(∑ f ak× f bk)/sqrt [(∑ f ak× f ak) × (∑ f bk× f bk)] formula (1)
Wherein, 1≤k≤n, n be more than or equal to 1 integer.
For the class be made up of one group of set of tags, such as the core feature vector corresponding to class can be obtained by the mode of the respective element in each proper vector corresponding to each set of tags in cumulative class.Such as, suppose that there is in class C the set of tags T be classified in class C 1to set of tags T m(m be more than or equal to 1 integer), and set of tags T 1to set of tags T mcorresponding proper vector is proper vector A respectively 1to proper vector A m, then the core feature vector A corresponding to class C ccan represent with following formula (2):
A c=(∑ f aj1, ∑ f aj2..., ∑ f ajn) formula (2)
Wherein 1≤j≤m, m be more than or equal to 1 integer.
The vector of the core feature corresponding to class C A is being calculated according to formula (2) cafterwards, above-mentioned formula (1) can be used to calculate a new set of tags T nEcorresponding proper vector A nEwith the core feature vector A corresponding to class C cbetween similarity.If there is multiple class, then calculate new set of tags T respectively nEcorresponding proper vector A nEand the similarity between the core feature vector corresponding to each class in multiple class.
Then, the method proceeds to step 108.In step 108, according to calculated similarity, set of tags is categorized in class close at least one class.
The size of the value of the proper vector corresponding to the set of tags that above-mentioned formula (1) calculates and the cosine similarity between the core feature vector corresponding to class illustrates the similarity degree between set of tags and class, and the value of cosine similarity is larger, then more similar between set of tags to class.Therefore, according to calculated similarity, can judge that whether set of tags is similar to class, thus set of tags is categorized in the class close to (namely similar).
Finally, the method proceeds to step 110.In step 110, the method terminates.
Described above is according to an embodiment of the invention to the overall flow of the method that set of tags is classified.The idiographic flow of the classifying step in the above-mentioned method that set of tags is classified is described in below with reference to Fig. 2.Fig. 2 is the process flow diagram of the idiographic flow of the classifying step illustrated in the method for classifying to set of tags according to an embodiment of the invention.
As shown in Figure 2, after calculate the similarity between the core feature vector corresponding to the proper vector corresponding to set of tags and each class in multiple class respectively according to above-mentioned steps 106, the method proceeds to step 200.In step 200, the similarity in the set of tags calculated and at least one class between each class and predetermined threshold are compared.This predetermined threshold can preset as required, and can adjust as required in actual use.By adjusting the size of threshold value, the precision that set of tags is classified can be controlled.
Suppose current existence 3 classes be made up of set of tags, be expressed as C 1, C 2and C 3.Class C 1, C 2and C 3corresponding core feature vector is respectively A 1, A 2and A 3.As the set of tags T that discovery one is new nEtime, determine this new set of tags T nEcorresponding proper vector is A nE.Calculate proper vector A respectively nEwith core feature vector A 1, A 2and A 3between similarity.Such as, when adopting cosine similarity, the value of the similarity calculated can be respectively 0.92,0.85 and 0.79.After the value calculating above-mentioned similarity, the value 0.92,0.85 and 0.79 of above-mentioned similarity is compared with predetermined threshold respectively.
Then, the method proceeds to step 202.In step 202., judge whether the similarity between each class in the set of tags that calculates and at least one class exceedes predetermined threshold.If the judged result of step 202 is "No", namely set of tags and all classes all dissimilar, then proceed to step 206.In step 206, set of tags be categorized in a new class, the class making this new comprises this set of tags.
In the above example, suppose that predetermined threshold is 0.93.Because the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities does not all exceed predetermined threshold 0.93, therefore new set of tags T nEwith current class C 1, C 2and C 3equal dissmilarity.Now, a new class C can be set up 4, and by new set of tags T nEbe categorized into new class C 4in, make new class C 4comprise new set of tags T nE.
If the judged result of step 202 is "Yes", then proceed to step 204.In step 204, judge whether that the class be greater than corresponding to the similarity of predetermined threshold has multiple, namely judge whether the similarity between set of tags and multiple class is all greater than predetermined threshold.If the judged result of step 204 is "No", represent that the similarity of set of tags only and between some classes is greater than predetermined threshold, the number being namely greater than the similarity of predetermined threshold is 1, then proceed to step 210.In step 210, set of tags is categorized in that class corresponding to the calculated similarity uniquely exceeding predetermined threshold.
In the above example, suppose that predetermined threshold is 0.90.Due in the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities, the value 0.92 of similarity is only had to exceed predetermined threshold 0.90, therefore by new set of tags T nEbe categorized into the class C corresponding to value 0.92 of above-mentioned similarity 1in.
If the judged result of step 204 is "Yes", represent that the similarity between set of tags and multiple class is greater than predetermined threshold, the number being namely greater than the similarity of predetermined threshold is multiple, then proceed in step 208.In a step 208, select to be greater than similarity maximum in multiple similarities of predetermined threshold, and set of tags is categorized in that class corresponding to selected maximum similarity.
In the above example, suppose that predetermined threshold is 0.80.Due in the value 0.92,0.85 and 0.79 of calculated above-mentioned 3 similarities, the value 0.92 and 0.85 of similarity all exceedes predetermined threshold 0.80, therefore in the value 0.92 and 0.85 of similarity exceeding predetermined threshold 0.80, the value of maximum similarity is selected, i.e. the value 0.92 of similarity.Then, by new set of tags T nEbe categorized into the class C corresponding to value 0.92 of above-mentioned maximum similarity 1in.
After step 206,208 and 210, proceed to step 212.In the step 212, the method stops.
Hereinbefore, utilize cosine similarity to calculate similarity between set of tags and set of tags and set of tags and by the similarity between the class that set of tags is formed.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, as long as the similarity that can calculate between set of tags and set of tags or set of tags and by the similarity between the class that set of tags is formed.
Hereinbefore, the quantity of included in class set of tags dynamically increases.After being categorized in certain class according to above-mentioned method of classifying to set of tags by set of tags, the quantity of set of tags included in such just increases one.Preferably, can after new set of tags be categorized in certain class, according to this new set of tags and in such before all set of tags of having comprised, the core feature utilizing above-mentioned formula (2) to recalculate corresponding to such is vectorial, and using vectorial as the new core feature corresponding to such for the core feature vector recalculated.After, when classifying to another set of tags, this another set of tags is carried out comparing of similarity with such new core feature vector.Therefore, according to the method for the present embodiment, the various features of various set of tags can be considered, thus it is same or similar more accurately, more effectively to judge between set of tags.
Below with reference to Fig. 3, the equipment of classifying to set of tags is according to another embodiment of the present invention described.Fig. 3 is the block scheme to the equipment that set of tags is classified illustrated according to another embodiment of the present invention.
As shown in Figure 3, the equipment 312 of classifying to set of tags mainly comprises synonym tally set determining unit 300, proper vector generation unit 302, similarity calculated 304 and set of tags taxon 306.Synonym tally set determining unit 300, determines the synonym tally set belonging to each label of inputted set of tags according to the multiple synonym tally sets stored in synonym tally set database 308.Proper vector generation unit, for generating the proper vector corresponding with inputted set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with element in set of tags.Similarity calculated, for calculate each class at least one class of storing in proper vector and class set database 310 core feature vector between similarity, wherein the value of each element of the core feature vector of class be the value of respective element in the character pair vector of each set of tags be classified in class and.Set of tags taxon 306, to be categorized into inputted set of tags according to calculated similarity in the close class at least one class stored in class set database 310.
Set of tags taxon 306 comprises class determining unit 3062.Whether class determining unit 3062 exceedes predetermined threshold according to the similarity in calculated set of tags and at least one class between each class, determines at least one class, whether each class is described close class.If do not have described close class at least one class, then described set of tags is categorized in a new class by class determining unit 3062.If close class has multiple, then set of tags is categorized in the class corresponding to calculated maximum similarity by class determining unit 3062.
It will be understood by those skilled in the art that and with the alternate manner of such as synonym tally set dictionary to provide above-mentioned multiple synonym tally sets, also otherwise can also can provide above-mentioned class.Synonym tally set database 308 and class set database 310 are stored in storage unit 314.Storage unit 314 is such as disk, flash memory, removable storer etc.Storage unit 314 can be included in above-mentioned equipment 312 of classifying to set of tags, or to be positioned at outside above-mentioned equipment 312 of classifying to set of tags and to be attached on above-mentioned equipment 312 of classifying to set of tags by wired or wireless mode.
Cosine similarity can be utilized to calculate similarity between set of tags and set of tags and set of tags and by the similarity between the class that set of tags is formed.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, as long as the similarity that can calculate between set of tags and set of tags or set of tags and by the similarity between the class that set of tags is formed.
Above-mentioned equipment 312 of classifying to set of tags is actually the equipment corresponding with above-mentioned method of classifying to set of tags.Therefore, will be omitted it herein describe in detail.
The method mixed data based on set of tags is described below with reference to Fig. 4.Fig. 4 illustrates based on the process flow diagram of set of tags to the method that data mix.
As shown in Figure 4, the method is from step 400.Then, the method proceeds to step 402.In step 402, use and above-mentioned by set of tags, at least one class is categorized into the method that set of tags is classified.Therefore, use the above-mentioned method that set of tags is classified, can the set of tags or user-defined different set of tags etc. of different data format standard be met, be dynamically divided into different classes according to its similarity each other, and the set of tags in each class be similar each other.
Then, the method proceeds to step 404.In step 404, each label of set of tags each in same class is replaced with respectively the appointment label in the synonym tally set belonging to it.After according to above-mentioned steps 402 set of tags being divided into different classes, each label of set of tags each in same class can be replaced to unified label respectively, thus each label similar in same class can be unified into identical set of tags, and by the data described by each similar set of tags before redescribing by obtained identical set of tags, to realize the mixing of the data with Similar content meaning.
Various method can be had to carry out the replacement operation of each label of each set of tags in above-mentioned same class.Such as, each label of set of tags each in same class can be replaced with the appointment label in the synonym tally set belonging to it, above-mentioned appointment label can be such as first label in the synonym tally set belonging to each label of each set of tags in same class or last label.Or, the frequency of utilization of each synonym label in the synonym tally set belonging to each label of each set of tags in same class such as can be added up for set of tags all in same class, and using synonym label the highest for frequency of utilization as above-mentioned appointment label.It will be understood by those skilled in the art that and other method can also be adopted to carry out the replacement operation of each label of each set of tags in above-mentioned same class, as long as can ensure that the appointment label after replacing can define corresponding data uniformly.
Then, the method proceeds to step 404.In step 404, the method terminates.
The equipment mixed data based on set of tags is described below with reference to Fig. 5.Fig. 5 illustrates based on the block scheme of set of tags to the equipment that data mix.
As shown in Figure 5, based on set of tags, taxon 503 and replacement unit 505 are mainly comprised to the equipment 501 that data mix.Taxon 503 uses above-mentioned equipment of classifying to set of tags that the set of tags in input data is categorized at least one class.Each label of set of tags each in same class is replaced with the appointment label in the synonym tally set belonging to it by replacement unit 505 respectively, thus each label similar in same class can be unified into identical set of tags, and redescribe inputted data by obtained identical set of tags, to realize the mixing of the data with Similar content meaning.
Above-mentionedly based on set of tags, the equipment corresponding with the above-mentioned method mixed data based on set of tags is actually to the equipment 501 that data mix.Therefore, will be omitted it herein describe in detail.
Fig. 6 is the block diagram of the example arrangement that the computing machine wherein realizing equipment of the present invention and method is shown.
In figure 6, CPU (central processing unit) (CPU) 601 performs various process according to the program stored in ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random access memory (RAM) 603.In RAM 603, also store the data required when CPU 601 performs various process etc. as required.
CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 is also connected to bus 604.
Following parts are connected to input/output interface 605: importation 606, comprise keyboard, mouse etc.; Output 607, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 608, comprises hard disk etc.; With communications portion 609, comprise network interface unit such as LAN card, modulator-demodular unit etc.Communications portion 609 is via network such as the Internet executive communication process.
As required, driver 610 is also connected to input/output interface 605.Detachable media 611 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 610 as required, and the computer program therefrom read is installed in storage area 608 as required.
When by software simulating above-mentioned steps and process, from network such as the Internet or storage medium, such as detachable media 611 installs the program forming software.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 6, distributes the detachable media 611 to provide program to user separately with method.The example of detachable media 611 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprise mini-disk (MD) and semiconductor memory.Or hard disk that storage medium can be ROM 602, comprise in storage area 608 etc., wherein computer program stored, and user is distributed to together with comprising their method.
The present invention is described with reference to specific embodiment in instructions above.But those of ordinary skill in the art understands, do not departing under the prerequisite as the scope of the present invention of claims restriction and can carry out various amendment and change.

Claims (8)

1. based on the method that set of tags mixes data, the corresponding data that wherein said set of tags comprises at least one label and defined by least one label described, described method comprises:
In multiple synonym tally set, determine the synonym tally set belonging to each label of described set of tags, wherein said synonym tally set is the set be made up of one group of label with same or similar meaning;
Generate the proper vector corresponding with described set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in described multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with described element in described set of tags;
Similarity between the core feature vector calculating each class in described proper vector and at least one class, the value of each element of the core feature vector of wherein said class be the value of respective element in the character pair vector of each set of tags be classified in described class and;
According to calculated similarity, described set of tags is categorized in class close at least one class described; And
Each label of set of tags each in same class is replaced with respectively the appointment label in the synonym tally set belonging to it.
2. method according to claim 1, wherein, described classifying step comprises:
Whether exceed predetermined threshold according to the similarity in calculated described set of tags and at least one class described between each class, determine at least one class described, whether each class is described close class; And
If there is no described close class at least one class described, then described set of tags is categorized in a new class.
3. method according to claim 2, wherein, if described close class has multiple, is then categorized into described set of tags in the class corresponding to calculated maximum similarity.
4. the method according to any one of claim 1-3, wherein, described similarity comprises cosine similarity.
5. based on the equipment that set of tags mixes data, the corresponding data that wherein said set of tags comprises at least one label and defined by least one label described, described equipment comprises:
Synonym tally set determining unit, in multiple synonym tally set, determine described set of tags each label belonging to synonym tally set, wherein said synonym tally set is the set be made up of one group of label with same or similar meaning;
Proper vector generation unit, for generating the proper vector corresponding with described set of tags, in generated proper vector, each element is corresponding from the different synonym tally sets in described multiple synonym tally set respectively, and the value of each element is the number of the label belonging to the synonym tally set corresponding with described element in described set of tags;
Similarity calculated, for calculate each class in described proper vector and at least one class core feature vector between similarity, the value of each element of the core feature vector of wherein said class be the value of respective element in the character pair vector of each set of tags be classified in described class and;
Set of tags taxon, for being categorized in class close at least one class described according to calculated similarity by described set of tags; And
Replacement unit, for replacing with the appointment label in the synonym tally set belonging to it respectively by each label of set of tags each in same class.
6. equipment according to claim 5, wherein, described set of tags taxon comprises:
Class determining unit, for whether exceeding predetermined threshold according to the similarity in calculated described set of tags and at least one class described between each class, determines at least one class described, whether each class is described close class; And if there is no described close class at least one class described, then described set of tags is categorized in a new class.
7. equipment according to claim 6, wherein, described class determining unit also for: if described close class has multiple, then described set of tags is categorized in the class corresponding to calculated maximum similarity.
8. the equipment according to any one of claim 5-7, wherein, described similarity comprises cosine similarity.
CN201110101514.2A 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data Expired - Fee Related CN102750289B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110101514.2A CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data
JP2012079208A JP5928091B2 (en) 2011-04-19 2012-03-30 Tag group classification method, apparatus, and data mashup method, apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110101514.2A CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data

Publications (2)

Publication Number Publication Date
CN102750289A CN102750289A (en) 2012-10-24
CN102750289B true CN102750289B (en) 2015-08-05

Family

ID=47030481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110101514.2A Expired - Fee Related CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data

Country Status (2)

Country Link
JP (1) JP5928091B2 (en)
CN (1) CN102750289B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7464351B2 (en) * 2014-08-27 2024-04-09 マシューズ インターナショナル コーポレイション Media generation system and method for implementing the same
CN106202090B (en) * 2015-05-04 2020-02-07 阿里巴巴集团控股有限公司 Information processing method, information searching method, information processing device, information searching device and server
JP6366852B2 (en) * 2016-02-29 2018-08-01 三菱電機株式会社 Equipment classification device
CN107229615A (en) * 2017-07-01 2017-10-03 王亚迪 A kind of network individual or colony value see automatic discriminating conduct
US11663184B2 (en) 2017-07-07 2023-05-30 Nec Corporation Information processing method of grouping data, information processing system for grouping data, and non-transitory computer readable storage medium
CN110309294B (en) * 2018-03-01 2022-03-15 阿里巴巴(中国)有限公司 Content set label determination method and device
CN111143346B (en) * 2018-11-02 2023-08-25 北京字节跳动网络技术有限公司 Tag group variability determination method and device, electronic equipment and readable medium
CN110245265B (en) * 2019-06-24 2021-11-02 北京奇艺世纪科技有限公司 Object classification method and device, storage medium and computer equipment
CN112434722B (en) * 2020-10-23 2024-03-19 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN113010737A (en) * 2021-03-25 2021-06-22 腾讯科技(深圳)有限公司 Video tag classification method and device and storage medium
CN114529772B (en) * 2022-04-19 2022-07-15 广东唯仁医疗科技有限公司 OCT three-dimensional image classification method, system, computer device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055585A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System and method for clustering documents
CN101114295A (en) * 2007-08-11 2008-01-30 腾讯科技(深圳)有限公司 Method for searching on-line advertisement resource and device thereof
CN101984437A (en) * 2010-11-23 2011-03-09 亿览在线网络技术(北京)有限公司 Music resource individual recommendation method and system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008084192A (en) * 2006-09-28 2008-04-10 Toshiba Corp Structured document retrieval device, structured document retrieval method and structured document retrieval program
JP4745419B2 (en) * 2009-05-15 2011-08-10 株式会社東芝 Document classification apparatus and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055585A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System and method for clustering documents
CN101114295A (en) * 2007-08-11 2008-01-30 腾讯科技(深圳)有限公司 Method for searching on-line advertisement resource and device thereof
CN101984437A (en) * 2010-11-23 2011-03-09 亿览在线网络技术(北京)有限公司 Music resource individual recommendation method and system thereof

Also Published As

Publication number Publication date
JP2012226740A (en) 2012-11-15
CN102750289A (en) 2012-10-24
JP5928091B2 (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN102750289B (en) Based on the method and apparatus that set of tags mixes data
Teunter et al. Dynamic lot sizing with product returns and remanufacturing
Logendran et al. Group scheduling in flexible flow shops
CN111428451B (en) Text online editing method and device, electronic equipment and storage medium
Dias et al. Layout and process optimisation: using computer-aided design (CAD) and simulation through an integrated systems design tool
CN107146095B (en) Method and device for processing display information of mail and mail system
CN112528013A (en) Text abstract extraction method and device, electronic equipment and storage medium
Gopsill et al. Investigating the effect of scale and scheduling strategies on the productivity of 3D managed print services
CN101354723B (en) Method and apparatus for implementing combined field
Shahnaghi et al. A robust modelling and optimisation framework for a batch processing flow shop production system in the presence of uncertainties
CN110837356A (en) Data processing method and device
CN106326522A (en) 3D fonts for automation of design for manufacturing
CN101719157A (en) Data filtering method, system and data processing device used for system
CN106202047A (en) A kind of character personality depicting method based on microblogging text
CN104133680A (en) Fast building method of ERP form module
CN112214602B (en) Humor-based text classification method and device, electronic equipment and storage medium
CN112115694B (en) Simulation report generation method and device based on multi-element data structure
CN104780148B (en) Server, terminal, the system and method for document on-line operation
CN114138675A (en) Interface test case generation method and device, electronic equipment and storage medium
CN114611477A (en) Design recommendation method and device for data table, electronic equipment and medium
Rosenblum The pros and cons of the'PACM'proposal: counterpoint
JP5168099B2 (en) Renovation work range division program, refurbishment work range division device, and refurbishment work range division method
CN110990256A (en) Open source code detection method, device and computer readable storage medium
Wisniewski et al. Critical Path Analysis and Linear Programming
CN117057910A (en) Visualized credit system management platform and control method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150805

Termination date: 20180419