CN102750289A - Tag group classifying method and equipment as well as data mixing method and equipment - Google Patents

Tag group classifying method and equipment as well as data mixing method and equipment Download PDF

Info

Publication number
CN102750289A
CN102750289A CN2011101015142A CN201110101514A CN102750289A CN 102750289 A CN102750289 A CN 102750289A CN 2011101015142 A CN2011101015142 A CN 2011101015142A CN 201110101514 A CN201110101514 A CN 201110101514A CN 102750289 A CN102750289 A CN 102750289A
Authority
CN
China
Prior art keywords
tags
type
synonym
similarity
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101015142A
Other languages
Chinese (zh)
Other versions
CN102750289B (en
Inventor
张军
钟朝亮
王主龙
大木宪二
田中昌弘
粂照宣
松尾昭彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201110101514.2A priority Critical patent/CN102750289B/en
Priority to JP2012079208A priority patent/JP5928091B2/en
Publication of CN102750289A publication Critical patent/CN102750289A/en
Application granted granted Critical
Publication of CN102750289B publication Critical patent/CN102750289B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a tag group classifying method and equipment as well as a data mixing method and equipment, wherein a tag group comprises at least one tag and corresponding data defined by at least one of the tags. The classifying method comprises the following steps of: determining a synonymous tag set to which each tag of each tag group belongs from a plurality of synonymous tag sets; generating feature vectors corresponding to the tag groups, wherein all elements in the generated feature vectors respectively correspond to different synonymous tag sets in the plurality of synonymous tag sets, and the value of each element is equal to the amount of tags belonging to the synonymous tag sets corresponding to the elements in the tag groups; calculating the similarity between each feature vector and a core feature vector of each class in at least one class, wherein the value of each element of the core feature vector of the class is equal to the sum of values of corresponding elements in corresponding feature vectors of all the tag groups classified to the class; and classifying the tag groups to a class similar to at least one class according to the calculated similarity.

Description

Set of tags sorting technique, equipment and data mixing method, equipment
Technical field
The present invention relates to data processing, relate more specifically to the sorting technique and the equipment of set of tags, and the data mixing method and apparatus.
Background technology
At present; There are the various data format standards that are used for data of description; XML (eXtensible Markup Language for example; Extend markup language), JSON (JavaScript Object Notation, JavaScript object representation) or CSV (Comma Separated Values, comma separated value) etc.In every kind of data format standard, defined the label of the implication that is used for the data of description content respectively.For example, for the data of tabulation type,, can define one group of label: title (title) of being used to describe news content, pubdate (issuing time), author (author) etc. for example for the news list that comprises some news; Again for example; For the schedule table that has comprised several schedules, can define one group of label: starttime (start time), endtime (concluding time), attendees (participator) and the location (place) etc. that are used to describe the schedule content.Therefore, utilize and to organize label, can issue easily or the visit data content.
But for the data content of identical or similar meaning, different data format standards possibly adopt different labels to describe.For example, " create the people of data " to data content, different data format standards possibly adopt " author (author) ", " writer (writer) " or different labels such as " creater (creators) ".Therefore, there is such demand: discern the data content of the same or similar implication of describing with different labels, and describe above-mentioned same or analogous data content, thereby accomplish the mixing of the data content of same or similar implication with unified label.
In the prior art, whether same or similarly itself judge between a plurality of data contents through a direct many data content.Because the data volume of data content itself is bigger, therefore directly many individual data contents itself often cause calculated amount big, and the accuracy of judging are also relatively poor.
In addition, also exist in the prior art through relatively whether more same or similar between two labels and judge whether same or analogous technology of two described data contents of label.But, there is various data format standard in actual the use, the label that it adopted also varies.Iff compares label and label, is difficult to take all factors into consideration the various features of various labels, and the accuracy that causes judging is also relatively poor.
And, as stated,, can define the one group of label (being called " set of tags " hereinafter) that is used to describe the news item content: title (title), pubdate (issuing time), author (author) etc. for example for the news list that comprises some news.This shows that a data content generally is defined by the set of tags that comprises several labels of describing this data content.Therefore, judge whether have same or similar implication between many data contents, whether same or similarly should comprehensively judge between a plurality of set of tags that are used to describe many data contents.If only label and label are compared, then be difficult to judge with comprising whether the described data content of set of tags of several labels has same or similar implication.
Summary of the invention
Consider the problems referred to above, the applicant recognizes should be through more a plurality of set of tags same or similar data content with same or similar implication of discerning whether.Whether core concept of the present invention is, same or similar for more a plurality of set of tags, can earlier same or analogous set of tags be divided into same type, again type comparing newfound set of tags and the set of tags of being divided.Because all set of tags in same type all are same or analogous, so the class of set of tags has been taken all factors into consideration the various features of various set of tags.So, through with type the comparing of set of tags and set of tags, can judge same or similar between the set of tags more accurately.
According to one embodiment of present invention; A kind of method that set of tags is classified is provided; Wherein set of tags comprises at least one label and by the corresponding data of at least one label definition, said method comprises: in a plurality of synonym tally sets, confirm the synonym tally set under each label of set of tags; Generate and the corresponding proper vector of set of tags; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element; Similarity in calculated characteristics vector and at least one type between the core feature vector of each type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and; According to the similarity that is calculated, set of tags is categorized in the class approaching at least one type.
Above-mentioned classifying step comprises: whether surpass predetermined threshold according to the similarity between each type in the set of tags that is calculated and at least one type, confirm whether each type is approaching class at least one type; And if do not have approaching class at least one type, then set of tags is categorized in the new class.
In above-mentioned classifying step, a plurality of if approaching class has, then set of tags is categorized in the pairing class of the maximum similarity that is calculated.
Above-mentioned similarity comprises the cosine similarity.
According to another embodiment of the present invention; A kind of equipment that set of tags is classified is provided; Wherein set of tags comprises at least one label and by the corresponding data of at least one label definition; The said equipment comprises: the synonym tally set is confirmed the unit, is used for the synonym tally set under each label of the definite set of tags of a plurality of synonym tally sets; The proper vector generation unit; Be used for generating and the corresponding proper vector of set of tags; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element; Similarity calculated; Be used for the similarity between the core feature vector of calculated characteristics vector and each type of at least one type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and; And the set of tags taxon, be used for set of tags being categorized in the approaching class of at least one type according to the similarity that is calculated.
Above-mentioned set of tags taxon comprises: type definite unit, be used for whether surpassing predetermined threshold according to the similarity between the set of tags that is calculated and each type of at least one type, and confirm whether each type is approaching class at least one type; And if do not have approaching class at least one type, then set of tags is categorized in the new class.
Above-mentioned type of determining unit also is used for: a plurality of if approaching class has, then set of tags is categorized in the pairing class of the maximum similarity that is calculated.
Above-mentioned similarity comprises the cosine similarity.
According to another embodiment of the present invention, provide a kind of and based on set of tags data have been carried out method of mixing, said method comprises: use the above-mentioned method that set of tags is classified that set of tags is categorized at least one type; And each label of each set of tags in same type replaced with the specify labels in the synonym tally set under it respectively.
According to another embodiment of the present invention, a kind of equipment that data is mixed based on set of tags is provided, the said equipment comprises: taxon is used to use the above-mentioned equipment that set of tags is classified that set of tags is categorized at least one type; And the replacement unit, be used for each label with same type each set of tags and replace with the specify labels in the synonym tally set under it respectively.
The present invention is through comparing the similarity between the core feature vector of the class of the proper vector of set of tags and set of tags; Can more accurately, more effectively judge same or similar between the set of tags, so can be more accurately, more effectively same or analogous data are mixed.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose, characteristics and advantage of the present invention to the embodiment of the invention with being more prone to.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.
Fig. 1 is the process flow diagram that the method that set of tags is classified is shown according to an embodiment of the invention;
Fig. 2 is the process flow diagram that the idiographic flow of the classifying step in the method that set of tags is classified is shown according to an embodiment of the invention;
Fig. 3 is the block scheme that the equipment that set of tags is classified according to another embodiment of the present invention is shown;
Fig. 4 be illustrate according to another embodiment of the present invention data are carried out the process flow diagram of method of mixing based on set of tags;
Fig. 5 is the block scheme that the equipment that data is mixed based on set of tags according to another embodiment of the present invention is shown.
Fig. 6 is the block diagram that the exemplary configurations that wherein realizes computing machine of the present invention is shown.
Embodiment
Used term among this paper only is in order to describe certain embodiments, and is not intended to limit the present invention." one " of used singulative and " being somebody's turn to do " are intended to also comprise plural form among this paper, only if point out separately clearly in the context.Also to know; When " comprising " speech and using in this manual; Explain and have pointed characteristic, integral body, step, operation, unit and/or assembly; Do not exist or increase one or more further features, integral body, step, operation, unit and/or assembly but do not get rid of, and/or their combination.
Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.The combination of blocks can be realized by computer program instructions in each square frame of process flow diagram and/or block diagram and process flow diagram and/or the block diagram.These computer program instructions can offer the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus; Thereby produce a kind of machine; Make and these instructions of carrying out through computing machine or other programmable data treating apparatus produce the device of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.
Also can be stored in these computer program instructions in ability command calculations machine or the computer-readable medium of other programmable data treating apparatus with ad hoc fashion work; Like this, the instruction that is stored in the computer-readable medium produces a manufacture that comprises the command device (instruction means) of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.
Also can be loaded into computer program instructions on computing machine or other programmable data treating apparatus; Make and on computing machine or other programmable data treating apparatus, carry out the sequence of operations step; Producing computer implemented process, thereby the instruction of on computing machine or other programmable device, carrying out just provides the process of the function/operation of stipulating in the square frame in realization flow figure and/or the block diagram.
Should be understood that process flow diagram and block diagram in the accompanying drawing, illustrate the system according to various embodiments of the invention, architectural framework in the cards, function and the operation of method and computer program product.In this, each square frame in process flow diagram or the block diagram can be represented the part of module, program segment or a code, and the part of said module, program segment or code comprises one or more executable instructions that are used to realize the logic function stipulated.Should be noted that also the order that the function that is marked in the square frame also can be marked to be different from the accompanying drawing takes place in some realization as replacement.For example, in fact the square frame that two adjoining lands are represented can be carried out basically concurrently, and they also can be carried out by opposite order sometimes, and this decides according to related function.Also be noted that; Each square frame in block diagram and/or the process flow diagram and the combination of the square frame in block diagram and/or the process flow diagram; Can realize with the hardware based system of the special use of function that puts rules into practice or operation, perhaps can use the combination of specialized hardware and computer instruction to realize.
With reference to figure 1 method of according to an embodiment of the invention set of tags being classified is described below.Fig. 1 is the process flow diagram that the method that set of tags is classified is shown according to an embodiment of the invention.
As shown in Figure 1, this method begins from step 100.Then, in step 102, the synonym tally set in a plurality of synonym tally sets under each label of definite set of tags.
Synonym tally set (S) is by one group of set that label constituted with same or similar meaning (being synonym).As an example, can there be following several synonym tally sets:
S 1: author (author), creator (creator), writer (writer)
S 2: pubdate (announcement time), publishdate (issuing time)
S 3: URL (uniform resource locator), link (link)
S 4: summary (summary), description (general introduction)
S 5: event (incident), title (title), what (what)
S 6: starttime (start time), when (when)
S 7: where (where), location (place)
……
S n: who (who), attendees (participator)
Wherein, n is the integer more than or equal to 1.
Above-mentioned synonym tally set only is an example, can also have other synonym tally set as required.Experience in can using according to reality comes definite in advance which label to represent same or analogous meaning.In addition, also can in use constantly newfound label with same or similar meaning be added in the above-mentioned synonym tally set, to dynamically update above-mentioned synonym tally set.Can above-mentioned synonym tally set be provided with the for example form of synonym dictionary.It will be understood by those skilled in the art that to provide above-mentioned synonym tally set with the alternate manner of for example database.
Set of tags (T) is by the one group of set that label constituted that is respectively applied for the corresponding data in data clauses and subclauses of definition.As an example, can there be following several set of tags:
T 1: title (title), author (author), pubdate (announcement time), summary (summary)
T 2: title (title), publishdate (issuing time), creator (founder), description (general introduction), URL (uniform resource locator)
T 3: title (title), link (link), writer (writer), description (general introduction)
T 4: title (title), link (link), writer (writer), description (general introduction)
T 5: event (incident), starttime (start time), endtime (concluding time), location (place), attendees (participator)
T 6: title (title), starttime (start time), duration (duration), where (where), attendees (participator)
……
T p: what (what), where (where), who (who), when (when)
Wherein, p is the integer more than or equal to 1.
Above-mentioned set of tags only is an example, can also have other set of tags in actual the use.For example, different data format standard (for example, XML, JSON or CSV etc.) can define different set of tags, and perhaps the publisher of data also can be according to the self-defined different set of tags of the needs of oneself.
To a new set of tags, can confirm the described synonym tally set of each label in the new set of tags according to above-mentioned synonym tally set.For example, to above-mentioned set of tags T 1, can be according to set of tags T 1In the order of each label come to confirm successively: set of tags T 1In label " title (title) " belong to synonym tally set S 5(be set of tags T 1In belong to synonym tally set S 5Number of labels be 1), set of tags T 1In label " author (author) " belong to synonym tally set S 1(be set of tags T 1In belong to synonym tally set S 1Number of tags be 1), set of tags T 1In label " announcement time " belong to synonym tally set S 2(be set of tags T 1In belong to synonym tally set S 2Number of tags be 1), and set of tags T 1In label " summary (summary) " belong to synonym tally set S 4(be set of tags T 1In belong to synonym tally set S 4Number of tags be 1).In addition, to above-mentioned set of tags T 1, also can be according to above-mentioned synonym tally set S 1To synonym S nOrder come successively to confirm: set of tags T 1In belong to synonym tally set S 1Number of tags be 1, tally set T 1In belong to synonym tally set S 2Number of tags be 1, tally set T 1In belong to synonym tally set S 3Number of tags be 0, tally set T 1In belong to synonym tally set S 4Number of tags be 1, tally set T 1In belong to synonym tally set S 5Number of tags be 1, tally set T 1In belong to synonym tally set S 6Number of tags be 0, and tally set T 1In belong to synonym tally set S 7To synonym tally set S nNumber of tags be 0.Can confirm above-mentioned set of tags T respectively after the same method 2To set of tags T PIn each set of tags in each label belong to above-mentioned synonym tally set S respectively 1To synonym tally set S nIn which tally set.
Then, this method proceeds to step 104.In step 104; Generate and the corresponding proper vector of set of tags; In the proper vector that is generated, each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element.
Based on definite result of above-mentioned steps 102, can generate and the corresponding characteristic vector of set of tags.For example, to set of tags T 1, corresponding to according to set of tags T 1In definite result of order of each label, can generate and set of tags T 1Corresponding proper vector A: (S 5: 1, S 1: 1, S 2: 1, S 4: 1), wherein, the part in each element before the colon is represented the pairing synonym tally set of this element, and the part in each element after the colon representes to belong in the set of tags 1 number with the label of the corresponding synonym tally set of this element.For example, to first element " S of proper vector A 5: 1 ", " S 5" represent that this first element is corresponding to synonym tally set S 5, and " 1 " expression set of tags T 1In belong to synonym tally set S 5The number of label be 1.In addition, to set of tags T 1, corresponding to according to above-mentioned synonym tally set S 1To synonym tally set S nDefinite result of order, can generate and set of tags T 1Corresponding proper vector A ': (S 1: 1, S 2: 1, S 3: 0, S 4: 1, S 5: 1, S 6: 0, S 7: 0 ..., S n: 0), wherein identical among the implication of the each several part of each element and the above-mentioned proper vector A repeated no more at this.After the same method, can generate respectively and above-mentioned set of tags T 1To set of tags T pIn the corresponding proper vector of each set of tags.
Then, this method proceeds to step 106.In step 106; Similarity in calculated characteristics vector and at least one type between the core feature vector of each type, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and.
Class is by one group of mutually the same or similar set that set of tags constituted, and promptly belonging to of a sort each set of tags is same or analogous each other.Can for example whether same or similarly judge between the set of tags according to the distance of the cosine between the set of tags.Process in the face of the cosine distance between the computation tag group describes down.
Suppose to have generated and set of tags T according to above-mentioned steps 104 1Corresponding proper vector A and having generated and set of tags T 2Corresponding proper vector B, wherein, proper vector A can be expressed as (S 1: f A1, S 2: f A2..., S n: f An), can be abbreviated as (f A1, f A2..., f An); Proper vector B can be expressed as (S 1: f B1, S 2: f B2..., S n: f Bn), can be abbreviated as (f B1, f B2..., f Bn).Wherein, S nN the pairing synonym tally set of element S among representation feature vector A or the proper vector B n, f AnExpression set of tags T 1In belong to proper vector A in n the corresponding synonym tally set of element S nThe number of label, f BnExpression set of tags T 2In belong to proper vector B in n the corresponding synonym tally set of element S nThe number of label.Can use following formula (1) to calculate corresponding to set of tags T 1Proper vector A with corresponding to set of tags T 2Proper vector B between the cosine similarity:
Similarity (A, B)=(∑ f Ak* f Bk)/sqrt [(∑ f Ak* f Ak) * (∑ f Bk* f Bk)] formula (1)
Wherein, 1≤k≤n, n are the integer more than or equal to 1.
For the class that set of tags constituted by a group, can be for example through the pairing core feature vector of mode type of acquisition of the respective element in pairing each proper vector of each set of tags in type of adding up.For example, has set of tags T among type of the being classified into C among type of the hypothesis C 1To set of tags T m(m is the integer more than or equal to 1), and set of tags T 1To set of tags T mPairing proper vector is respectively proper vector A 1To proper vector A m, the pairing core feature vector of type C A then CCan use following formula (2) to represent:
AC=(∑ f Aj1, ∑ f Aj2..., ∑ f Ajn) formula (2)
1≤j≤m wherein, m is the integer more than or equal to 1.
According to the pairing core feature of formula (2) type of calculating C vector A CAfterwards, can use above-mentioned formula (1) to calculate a new set of tags T NEPairing proper vector A NEWith the pairing core feature vector of class C A CBetween similarity.If have a plurality of types, then calculate new set of tags T respectively NEPairing proper vector A NEAnd the similarity between the pairing core feature vector of each type in a plurality of types.
Then, this method proceeds to step 108.In step 108, set of tags is categorized in the class approaching at least one type based on the similarity that is calculated.
The size of the value of the cosine similarity between pairing proper vector of set of tags that is calculated according to above-mentioned formula (1) and type pairing core feature vector has been represented the similarity degree between set of tags and the class; And the value of cosine similarity is big more, then set of tags with type between similar more.Therefore, can judge whether set of tags is similar with class, thereby set of tags is categorized in the class near (promptly similar) based on the similarity that is calculated.
At last, this method proceeds to step 110.In step 110, this method finishes.
The overall flow of the method that set of tags is classified has been described according to an embodiment of the invention above.Be described in the idiographic flow of the classifying step in the above-mentioned method that set of tags is classified below with reference to figure 2.Fig. 2 is the process flow diagram that the idiographic flow of the classifying step in the method that set of tags is classified is shown according to an embodiment of the invention.
As shown in Figure 2, after the similarity between the pairing core feature vector of each type in calculate the pairing proper vector of set of tags and a plurality of class respectively according to above-mentioned steps 106, this method proceeds to step 200.In step 200, similarity between each type and predetermined threshold in the set of tags that calculates and at least one type are compared.This predetermined threshold can preestablish as required, and can adjust as required in actual use.Through the size of adjustment threshold value, can control the precision that set of tags is classified.
Suppose 3 classes of forming by set of tags of current existence, be expressed as C respectively 1, C 2And C 3Class C 1, C 2And C 3Pairing core feature vector is respectively A 1, A 2And A 3When finding a new set of tags T NEThe time, confirm the set of tags T that this is new NEPairing proper vector is A NEDifference calculated characteristics vector A NEWith core feature vector A 1, A 2And A 3Between similarity.For example, under the situation that adopts the cosine similarity, the value of the similarity that calculates can be respectively 0.92,0.85 and 0.79.After calculating the value of above-mentioned similarity, the value 0.92,0.85 and 0.79 of above-mentioned similarity is compared with predetermined threshold respectively.
Then, this method proceeds to step 202.In step 202, judge whether the set of tags and the similarity between each type at least one type that are calculated surpass predetermined threshold.If the judged result of step 202 is a "No", promptly set of tags and all equal dissmilarities of class then proceed to step 206.In step 206, set of tags is categorized in the new class, make to comprise this set of tags in this new class.
In the above example, suppose that predetermined threshold is 0.93.Because the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated does not all surpass predetermined threshold 0.93, therefore new set of tags T NEWith current class C 1, C 2And C 3All dissimilar.At this moment, can set up a new class C 4, and with new set of tags T NEBe categorized into new class C 4In, make new class C 4Comprise new set of tags T NE
If the judged result of step 202 is " being ", then proceed to step 204.In step 204, judge whether to have a plurality of greater than the pairing class of the similarity of predetermined threshold, judge that promptly whether similarity between set of tags and a plurality of classes is all greater than predetermined threshold.If the judged result of step 204 is " denying ", the expression set of tags only and the similarity between some type greater than predetermined threshold, promptly the number greater than the similarity of predetermined threshold is 1, then proceeds to step 210.In step 210, set of tags is categorized in pairing that type of unique similarity above predetermined threshold that is calculated.
In the above example, suppose that predetermined threshold is 0.90.Owing in the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated, only have the value 0.92 of similarity to surpass predetermined threshold 0.90, therefore with new set of tags T NEBe categorized into 0.92 pairing type of C of value of above-mentioned similarity 1In.
If the judged result of step 204 is " being ", the similarity between expression set of tags and a plurality of class is greater than predetermined threshold, and promptly the number greater than the similarity of predetermined threshold is a plurality of, then proceeds in the step 208.In step 208, select greater than similarity maximum in a plurality of similarities of predetermined threshold, and set of tags is categorized in pairing that type of selected maximum similarity.
In the above example, suppose that predetermined threshold is 0.80.Because in the value 0.92,0.85 and 0.79 of above-mentioned 3 similarities that calculated; The value 0.92 and 0.85 of similarity is all above predetermined threshold 0.80; Therefore in the value 0.92 and 0.85 of the similarity that surpasses predetermined threshold 0.80, select the value of maximum similarity, i.e. the value 0.92 of similarity.Then, with new set of tags T NEBe categorized into 0.92 pairing type of C of value of the similarity of above-mentioned maximum 1In.
In step 206, after 208 and 210, proceed to step 212.In step 212, this method stops.
Hereinbefore, utilize the cosine similarity come similarity and the set of tags between computation tag group and the set of tags and the class that constitutes by set of tags between similarity.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, if can calculate similarity or the set of tags between set of tags and the set of tags and the class that constitutes by set of tags between similarity.
Hereinbefore, the quantity of included set of tags is dynamically to increase in the class.After in set of tags being categorized into certain type according to the above-mentioned method that set of tags is classified, quantity of included set of tags just increases one in such.Preferably; Can new set of tags is categorized into certain type in after; All set of tags of having comprised according to this new set of tags and before in such; Utilize above-mentioned formula (2) to recomputate such pairing core feature vector, and the core feature vector that recomputates is vectorial as such pairing new core feature.After, when another set of tags is carried out the branch time-like, this another set of tags and such new core feature vector are carried out the comparison of similarity.Therefore,, the various features of various set of tags can be taken all factors into consideration, thereby same or similar between the set of tags can be more accurately, more effectively judged according to the method for present embodiment.
With reference to figure 3 equipment that set of tags is classified is according to another embodiment of the present invention described below.Fig. 3 is the block scheme that the equipment that set of tags is classified according to another embodiment of the present invention is shown.
As shown in Figure 3, the equipment 312 that set of tags is classified comprises that mainly the synonym tally set confirms unit 300, proper vector generation unit 302, similarity calculated 304 and set of tags taxon 306.The synonym tally set is confirmed unit 300, the synonym tally set under each label of the set of tags of confirming according to a plurality of synonym tally sets of being stored in the synonym tally set database 308 to be imported.The proper vector generation unit; The corresponding proper vector of set of tags that is used to generate and is imported; In the proper vector that is generated; Each element respectively with a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the set of tags with the label of the corresponding synonym tally set of element.Similarity calculated; Be used for the similarity between the core feature vector of each type at least one type that calculated characteristics vector and class set database 310 stored, the value of each element of core feature vector wherein type be the respective element in the character pair vector of each set of tags in type of being classified into value and.Set of tags taxon 306 is categorized into the set of tags of being imported in the approaching class at least one type of being stored in the class set database 310 according to the similarity that is calculated.
The definite unit 3062 of set of tags taxon 306 types of comprising.Whether the definite unit 3062 of class surpasses predetermined threshold based on the similarity between each type in the set of tags that is calculated and at least one type, confirms whether each type is said approaching class at least one type.If there is not said approaching class at least one type, then type definite unit 3062 is categorized into said set of tags in the new class.If it is a plurality of that approaching class has, then type definite unit 3062 is categorized into set of tags in the pairing class of the maximum similarity that is calculated.
It will be understood by those skilled in the art that to provide above-mentioned a plurality of synonym tally sets with the alternate manner such as synonym tally set dictionary, also above-mentioned class can be provided otherwise.Synonym tally set database 308 is stored in the storage unit 314 with class set database 310.Storage unit 314 for example is disk, flash memory, removable storer etc.Storage unit 314 can be included in the above-mentioned equipment 312 that set of tags is classified, and perhaps is positioned at outside the above-mentioned equipment 312 that set of tags is classified and appends on the above-mentioned equipment 312 that set of tags is classified through wired or wireless mode.
Can utilize the cosine similarity come similarity and the set of tags between computation tag group and the set of tags and the class that constitutes by set of tags between similarity.But, it will be understood by those skilled in the art that the similarity calculating method that can also adopt other, if can calculate similarity or the set of tags between set of tags and the set of tags and the class that constitutes by set of tags between similarity.
The above-mentioned equipment 312 that set of tags is classified is actually and the above-mentioned corresponding equipment of method that set of tags is classified.Therefore, will omit its detailed description here.
Describe based on set of tags with reference to figure 4 below data are carried out method of mixing.Fig. 4 illustrates the process flow diagram that data is carried out method of mixing based on set of tags.
As shown in Figure 4, this method begins from step 400.Then, this method proceeds to step 402.In step 402, use the above-mentioned method that set of tags is classified that set of tags is categorized at least one type.Therefore; Use the above-mentioned method that set of tags is classified; Can set of tags that meet the different data format standard or user-defined different set of tags etc. dynamically be divided into different classes according to its similarity each other, and the set of tags in each type be similar each other.
Then, this method proceeds to step 404.In step 404, each label of each set of tags in same type is replaced with the specify labels in the synonym tally set under it respectively.After set of tags being divided into different classes according to above-mentioned steps 402; Can each label of each set of tags in the same class be replaced to unified label respectively; Thereby can each similar label in the same class be unified into identical set of tags; And before redescribing with resulting identical set of tags with each similar set of tags described data, with the mixing of the data that realize having similar content meaning.
Can there be the whole bag of tricks to carry out the replacement operation of each label of each set of tags in above-mentioned same type.For example; Can each label of each set of tags in same type be replaced with the specify labels in the synonym tally set under it, above-mentioned specify labels can for example be first label or last label in the synonym tally set under each label of each set of tags in same type.Perhaps, for example can add up the frequency of utilization of each synonym label in the synonym tally set under each label of each set of tags in same type to all set of tags in same type, and the synonym label that frequency of utilization is the highest is as above-mentioned specify labels.It will be understood by those skilled in the art that the method that can also adopt other carry out above-mentioned same type in the replacement operation of each label of each set of tags, as long as the specify labels after guaranteeing to replace can define corresponding data uniformly.
Then, this method proceeds to step 404.In step 404, this method finishes.
The equipment that data is mixed based on set of tags is described below with reference to Fig. 5.Fig. 5 is the block scheme that the equipment that data is mixed based on set of tags is shown.
As shown in Figure 5, the equipment 501 that data is mixed based on set of tags mainly comprises taxon 503 and replacement unit 505.The set of tags that taxon 503 uses the above-mentioned equipment that set of tags is classified will import in the data is categorized at least one type.Replacement unit 505 replaces with the specify labels in the synonym tally set under it respectively with each label of each set of tags in same type; Thereby can each similar label in the same class be unified into identical set of tags; And redescribe the data of being imported with resulting identical set of tags, have the mixing of the data of similar content meaning with realization.
The above-mentioned equipment 501 that data is mixed based on set of tags is actually with above-mentioned and based on set of tags data is carried out the corresponding equipment of method of mixing.Therefore, will omit its detailed description here.
Fig. 6 is the block diagram that the exemplary configurations of the computing machine of wherein realizing equipment of the present invention and method is shown.
In Fig. 6, CPU (CPU) 601 carries out various processing according to program stored among ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random-access memory (ram) 603.In RAM 603, also store data required when CPU 601 carries out various processing or the like as required.
CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 also is connected to bus 604.
Following parts are connected to input/output interface 605: importation 606 comprises keyboard, mouse or the like; Output 607 comprises display, such as cathode ray tube (CRT), LCD (LCD) or the like and loudspeaker or the like; Storage area 608 comprises hard disk or the like; With communications portion 609, comprise that NIC is such as LAN card, modulator-demodular unit or the like.Communications portion 609 is handled such as the Internet executive communication via network.
As required, driver 610 also is connected to input/output interface 605.Detachable media 611 is installed on the driver 610 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 608.
Realizing through software under the situation of above-mentioned steps and processing, such as detachable media 611 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 6 wherein having program stored therein, distribute so that the detachable media 611 of program to be provided to the user with method with being separated.The example of detachable media 611 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk and (comprises mini-disk (MD) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 602, the storage area 608 or the like, computer program stored wherein, and be distributed to the user with the method that comprises them.
With reference to specific embodiment the present invention has been described in the instructions in front.Yet those of ordinary skill in the art understands, and under the prerequisite that does not depart from the scope of the present invention that limits like claims, can carry out various modifications and change.

Claims (10)

1. method that set of tags is classified, wherein said set of tags comprise at least one label and by the corresponding data of said at least one label definition, said method comprises:
Synonym tally set in a plurality of synonym tally sets under each label of definite said set of tags;
Generate and the corresponding proper vector of said set of tags; In the proper vector that is generated; Each element respectively with said a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the said set of tags with the label of the corresponding synonym tally set of said element;
Calculate the similarity between the core feature vector of each type in said proper vector and at least one type, the value of each element of wherein said type core feature vector be classified into the respective element in the character pair vector of each set of tags in said type value with; And
Based on the similarity that is calculated, said set of tags is categorized in the class approaching in said at least one type.
2. method according to claim 1, wherein, said classifying step comprises:
Whether surpass predetermined threshold according to the similarity between each type in the said set of tags that is calculated and said at least one type, confirm whether each type is said approaching class in said at least one type; And
If there is not said approaching class in said at least one type, then said set of tags is categorized in the new class.
3. method according to claim 2, wherein a plurality of if said approaching class has, then said set of tags is categorized in the pairing class of the maximum similarity that is calculated.
4. like each described method among the claim 1-3, wherein, said similarity comprises the cosine similarity.
5. equipment that set of tags is classified, wherein said set of tags comprise at least one label and by the corresponding data of said at least one label definition, said equipment comprises:
The synonym tally set is confirmed the unit, is used for the synonym tally set under each label of the definite said set of tags of a plurality of synonym tally sets;
The proper vector generation unit; Be used for generating and the corresponding proper vector of said set of tags; In the proper vector that is generated; Each element respectively with said a plurality of synonym tally sets in different synonym tally sets corresponding, the value of each element is the number that belongs in the said set of tags with the label of the corresponding synonym tally set of said element;
Similarity calculated; Be used for calculating the similarity between the core feature vector of said proper vector and each type of at least one type, the value of each element of wherein said type core feature vector be classified into the respective element in the character pair vector of each set of tags in said type value with; And
The set of tags taxon is used for according to the similarity that is calculated said set of tags being categorized in the approaching class of said at least one type.
6. equipment according to claim 5, wherein, said set of tags taxon comprises:
The definite unit of class is used for whether surpassing predetermined threshold according to the similarity between the said set of tags that is calculated and said each type of at least one type, confirms whether each type is said approaching class in said at least one type; And if do not have said approaching class in said at least one type, then said set of tags is categorized in the new class.
7. equipment according to claim 6, wherein, said type of definite unit also is used for: a plurality of if said approaching class has, then said set of tags is categorized in the pairing class of the maximum similarity that is calculated.
8. according to each described equipment among the claim 5-7, wherein, said similarity comprises the cosine similarity.
9. one kind is carried out method of mixing based on set of tags to data, and said method comprises:
Use like each described method that set of tags is classified among the claim 1-4 set of tags is categorized at least one type; And
Each label of each set of tags in same type is replaced with the specify labels in the synonym tally set under it respectively.
10. equipment that data is mixed based on set of tags, said equipment comprises:
Taxon is used for using like each described equipment that set of tags is classified of claim 5-8 set of tags is categorized at least one type; And
The replacement unit is used for each label with same type each set of tags and replaces with the specify labels in the synonym tally set under it respectively.
CN201110101514.2A 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data Expired - Fee Related CN102750289B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110101514.2A CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data
JP2012079208A JP5928091B2 (en) 2011-04-19 2012-03-30 Tag group classification method, apparatus, and data mashup method, apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110101514.2A CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data

Publications (2)

Publication Number Publication Date
CN102750289A true CN102750289A (en) 2012-10-24
CN102750289B CN102750289B (en) 2015-08-05

Family

ID=47030481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110101514.2A Expired - Fee Related CN102750289B (en) 2011-04-19 2011-04-19 Based on the method and apparatus that set of tags mixes data

Country Status (2)

Country Link
JP (1) JP5928091B2 (en)
CN (1) CN102750289B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202090A (en) * 2015-05-04 2016-12-07 阿里巴巴集团控股有限公司 A kind of information processing, searching method and device, server
CN108700872A (en) * 2016-02-29 2018-10-23 三菱电机株式会社 Machine sort device
CN110245265A (en) * 2019-06-24 2019-09-17 北京奇艺世纪科技有限公司 A kind of object classification method, device, storage medium and computer equipment
CN110309294A (en) * 2018-03-01 2019-10-08 优酷网络技术(北京)有限公司 The label of properties collection determines method and device
CN111143346A (en) * 2018-11-02 2020-05-12 北京字节跳动网络技术有限公司 Method and device for determining difference of tag group, electronic equipment and readable medium
CN112434722A (en) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN113010737A (en) * 2021-03-25 2021-06-22 腾讯科技(深圳)有限公司 Video tag classification method and device and storage medium
CN114529772A (en) * 2022-04-19 2022-05-24 广东唯仁医疗科技有限公司 OCT three-dimensional image classification method, system, computer device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160063576A1 (en) * 2014-08-27 2016-03-03 Sgk Media generation system and methods of performing the same related applications
CN107229615A (en) * 2017-07-01 2017-10-03 王亚迪 A kind of network individual or colony value see automatic discriminating conduct
WO2019008961A1 (en) * 2017-07-07 2019-01-10 日本電気株式会社 Information processing device, information processing method, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055585A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System and method for clustering documents
CN101114295A (en) * 2007-08-11 2008-01-30 腾讯科技(深圳)有限公司 Method for searching on-line advertisement resource and device thereof
CN101984437A (en) * 2010-11-23 2011-03-09 亿览在线网络技术(北京)有限公司 Music resource individual recommendation method and system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008084192A (en) * 2006-09-28 2008-04-10 Toshiba Corp Structured document retrieval device, structured document retrieval method and structured document retrieval program
JP4745419B2 (en) * 2009-05-15 2011-08-10 株式会社東芝 Document classification apparatus and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055585A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System and method for clustering documents
CN101114295A (en) * 2007-08-11 2008-01-30 腾讯科技(深圳)有限公司 Method for searching on-line advertisement resource and device thereof
CN101984437A (en) * 2010-11-23 2011-03-09 亿览在线网络技术(北京)有限公司 Music resource individual recommendation method and system thereof

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202090B (en) * 2015-05-04 2020-02-07 阿里巴巴集团控股有限公司 Information processing method, information searching method, information processing device, information searching device and server
CN106202090A (en) * 2015-05-04 2016-12-07 阿里巴巴集团控股有限公司 A kind of information processing, searching method and device, server
CN108700872A (en) * 2016-02-29 2018-10-23 三菱电机株式会社 Machine sort device
CN110309294A (en) * 2018-03-01 2019-10-08 优酷网络技术(北京)有限公司 The label of properties collection determines method and device
CN111143346B (en) * 2018-11-02 2023-08-25 北京字节跳动网络技术有限公司 Tag group variability determination method and device, electronic equipment and readable medium
CN111143346A (en) * 2018-11-02 2020-05-12 北京字节跳动网络技术有限公司 Method and device for determining difference of tag group, electronic equipment and readable medium
CN110245265A (en) * 2019-06-24 2019-09-17 北京奇艺世纪科技有限公司 A kind of object classification method, device, storage medium and computer equipment
CN112434722A (en) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN112434722B (en) * 2020-10-23 2024-03-19 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN113010737A (en) * 2021-03-25 2021-06-22 腾讯科技(深圳)有限公司 Video tag classification method and device and storage medium
CN113010737B (en) * 2021-03-25 2024-04-30 腾讯科技(深圳)有限公司 Video tag classification method, device and storage medium
CN114529772A (en) * 2022-04-19 2022-05-24 广东唯仁医疗科技有限公司 OCT three-dimensional image classification method, system, computer device and storage medium
CN114529772B (en) * 2022-04-19 2022-07-15 广东唯仁医疗科技有限公司 OCT three-dimensional image classification method, system, computer device and storage medium

Also Published As

Publication number Publication date
CN102750289B (en) 2015-08-05
JP5928091B2 (en) 2016-06-01
JP2012226740A (en) 2012-11-15

Similar Documents

Publication Publication Date Title
CN102750289A (en) Tag group classifying method and equipment as well as data mixing method and equipment
CN108877782B (en) Speech recognition method and device
Aiello et al. Machine learning with python and h2o
CN110516251B (en) Method, device, equipment and medium for constructing electronic commerce entity identification model
CN103559199A (en) Web information extraction method and web information extraction device
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN109117474A (en) Calculation method, device and the storage medium of statement similarity
CN110473073A (en) The method and device that linear weighted function is recommended
CN110837356A (en) Data processing method and device
CN110717333B (en) Automatic generation method and device for article abstract and computer readable storage medium
CN101470699B (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN112214602B (en) Humor-based text classification method and device, electronic equipment and storage medium
CN106844490A (en) The structuring of BIM non-geometry information databases and the method and system for interacting
CN106202047A (en) A kind of character personality depicting method based on microblogging text
CN113255079A (en) Artificial intelligence-based package design method and device
CN103136166B (en) Method and device for font determination
Sun et al. A DE‐LS Metaheuristic Algorithm for Hybrid Flow‐Shop Scheduling Problem considering Multiple Requirements of Customers
CN109960553A (en) A kind of more window context rendering methods and system
CN116257224A (en) Building block type and visual personalized investment method creation and operation system
CN103186514B (en) For realizing the method and apparatus of file structure
CN102663040A (en) Method for obtaining attribute column weights based on KL (Kullback-Leibler) divergence training for positive-pair and negative-pair constrained data
Rosenblum The pros and cons of the'PACM'proposal: counterpoint
CN105045774A (en) WYSIWYG user interface operation manual compiling and generating method
CN104217018A (en) Method and device for processing operation data
Zhao et al. Opportunities and challenges of artificial intelligence generated content on the development of new digital economy in Metaverse

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150805

Termination date: 20180419