WO2017071370A1 - Label processing method and device - Google Patents

Label processing method and device Download PDF

Info

Publication number
WO2017071370A1
WO2017071370A1 PCT/CN2016/094417 CN2016094417W WO2017071370A1 WO 2017071370 A1 WO2017071370 A1 WO 2017071370A1 CN 2016094417 W CN2016094417 W CN 2016094417W WO 2017071370 A1 WO2017071370 A1 WO 2017071370A1
Authority
WO
WIPO (PCT)
Prior art keywords
similar
processed
tag
word
words
Prior art date
Application number
PCT/CN2016/094417
Other languages
French (fr)
Chinese (zh)
Inventor
张传武
梅峰
李晓明
邢加和
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017071370A1 publication Critical patent/WO2017071370A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the present application relates to the field of electronic information, and in particular, to a label processing method and apparatus.
  • Tag is used to mark the classification or content of the target. It is a kind of content organization. It is a kind of special metadata. It is a summary of the subjective feelings of the resource by the labeling party. It is used by users to describe and classify resources so as to facilitate Search and share.
  • the label currently output by the label system may be similar words, that is, the label system may use different similar labels to represent the same category or content.
  • the label system's label for a garment output is sometimes "fashion” and sometimes "fashionable”.
  • the present application provides a method and apparatus for processing a note, with the aim of solving the problem of how to use the same label to express the same category or content.
  • a first aspect of the present application provides a label processing method, including:
  • the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, the representative word Selected from the member words;
  • a representative word of the similar word set to which the to-be-processed tag belongs is used as a substitute tag of the to-be-processed tag.
  • the specific process that the representative word is selected from the member words includes:
  • the member word with the largest product is used as the representative word.
  • the selecting a similar word set to which the to-be-processed tag belongs from the similar vocabulary includes:
  • the similar word set including the to-be-processed tag is used as a similar word set to which the to-be-processed tag belongs;
  • the similar words of the tag to be processed are searched from the similar vocabulary, and a similar word set of similar words including the tag to be processed is included. As a similar set of words to which the tag to be processed belongs.
  • the searching for the similar words of the to-be-processed label from the similar vocabulary includes:
  • Finding a first type of related words from the corpus the first type of related words being words that appear together with the to-be-processed label in a corpus with a frequency greater than a first threshold;
  • the frequency of the second type of related words and the to-be-processed tags coexisting in a corpus is counted, and the second type of related words whose statistics are less than the third threshold is used as the similar words of the to-be-processed tags.
  • the method further includes:
  • the tag to be processed is added to a similar set of words including similar words of the tag to be processed.
  • the fifth implementation in the first aspect in the formula, it also includes:
  • the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary.
  • a second aspect of the present application provides a processing apparatus for a label, including:
  • An obtaining module configured to obtain a label to be processed
  • a processing processing module configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, The representative word is selected from the member words;
  • an alternative module configured to use a representative word of the similar word set to which the to-be-processed tag belongs as a substitute tag of the to-be-processed tag.
  • the method further includes:
  • a determining module for counting the number of times each member word is historically treated as a tag to be processed; counting the reverse file frequency IDF value of each member word; for each member word, calculating the number of times it is historically treated as a tag to be processed The product of the IDF value; the member word with the largest product is used as the representative word.
  • the processing module is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs include:
  • the processing module is specifically configured to: search for the to-be-processed tag from the similar vocabulary; if the similar vocabulary includes the to-be-processed tag, include a similar word set of the to-be-processed tag As a similar word set to which the to-be-processed tag belongs; if the similar-word library does not include the to-be-processed tag, look up the similar words of the to-be-processed tag from the similar term library, and include The similar word set of the similar words of the to-be-processed tag is used as the similar word set to which the to-be-processed tag belongs.
  • the processing module is configured to search for the similar words of the to-be-processed tag from the similar vocabulary, including:
  • the processing module is specifically configured to: search for a first type of related word from the corpus, where the first type of related word is a word that appears in the corpus together with the to-be-processed tag and has a frequency greater than a first threshold; Finding a second type of related words in the corpus, the second type of related words being common to the first type of related words a word that appears in a corpus with a frequency greater than a second threshold; a frequency at which a second type of associated word appears in a corpus together with the tag to be processed, and a second type of associated word whose statistic frequency is less than a third threshold A similar word as the label to be processed.
  • the processing module is further configured to:
  • the tag to be processed is added to a similar word set including similar words of the tag to be processed.
  • the processing module is further configured to:
  • the label to be processed is added to the similar vocabulary, and the label to be processed is new in the similar vocabulary.
  • the label processing method and device of the present application acquires a tag to be processed, selects a similar word set to which the tag to be processed belongs from a similar vocabulary including a similar word set, and represents a similar word set to which the tag to be processed belongs.
  • the word is used as a substitute label for the label to be processed, because any similar word set in the similar vocabulary includes the member word and the representative word selected from the member words in the similar word set, so the method and method described in the present application
  • the device replaces the tag to be processed with a representative word of a similar word set, so for different but similar tags, the final surrogate tag is the same (ie, the representative word of a similar phrase), so, by using in a similar lexicon
  • the configuration and maintenance of a similar set of words can achieve the purpose of expressing the same classification or content using a uniform label.
  • FIG. 1 is a flowchart of a label processing method according to an embodiment of the present application
  • FIG. 2 is a flowchart of still another method for processing a label according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of searching for similar words of a tag to be processed from a similar vocabulary according to an embodiment of the present application; Flow chart of the body process;
  • FIG. 4 is a flowchart of a specific process for selecting a representative word from a member word according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a label processing apparatus according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a label processing device according to an embodiment of the present disclosure.
  • the embodiment of the present application can be applied to a scenario in which a tag is set or output.
  • the server outputs a label for an Internet site, or the operator user is marked with a behavior preference tag.
  • the execution body of the method of the present application may be a server.
  • the execution body of the label processing method described in this embodiment may be a server that sets or outputs a label.
  • a label processing method disclosed in the embodiment of the present application, as shown in FIG. 1 includes the following steps:
  • the tag to be processed can be a tag set by the user or a tag generated by the tag system.
  • S102 Determine whether a similar word set to which the tag to be processed belongs is selected from a similar vocabulary including a similar word set, and if yes, execute S103, and if no, execute S104.
  • the similar word set to which the tag to be processed belongs is a similar word set including similar words of the tag to be processed or the tag to be processed.
  • any similar word set included in the similar vocabulary includes a member word and a representative word, and the representative words of the set are selected from the member words of the set.
  • a part of a similar lexicon including four similar word sets numbered 0001 to 0004, each of which includes a member word and a representative word, such as a similar phrase numbered 0001, “Model”, “Madou” and “model” are all member words, and “model” is a representative word.
  • Group number Member word 1 Member word 2 Member word 3 ... Representative word 0001 Model Madou Model ... Model 0002 joke cold joke ... joke 0003 fashion fashionable Avantgarde ... fashion 0004 Sweet potato sweet potato Sweet potato ... sweet potato ... ... ... ... ... ...
  • one of the member words of the group may be arbitrarily selected as the representative word of the group, or the representative words may be selected from the group words of the group according to other rules. Specific rules are described in the following embodiments.
  • the similar words are similar words, that is, synonyms, and the specific definitions thereof can be referred to the prior art.
  • S103 The representative word of the similar word set to which the tag to be processed belongs is used as a substitute tag of the tag to be processed.
  • the label to be processed is “Madou”, because the similar word set numbered 0001 includes the label to be processed as “Madou”, and the representative word “Model” of the 0001 collection is used as a substitute label.
  • the label to be processed or its similar words is not found in the similarity vocabulary, it means that there is no label in the similarity vocabulary or a word similar to the meaning of the label, in this case, the pending
  • the tag is added to the similar lexicon as a new word, and as a member of the new similar phrase in the similar lexicon, because there are no other words in the new similar phrase, the tag to be processed also serves as the representative of the similar phrase. .
  • Group number Member word 1 Member word 2 Member word 3 ... Representative word 0001 Model Madou Model ... Model 0002 joke cold joke Paragraph ... joke 0003 fashion fashionable Avantgarde ... fashion 0004 Sweet potato sweet potato Sweet potato ... sweet potato 0005 art ... ... ... art
  • S104 The purpose of S104 is to continuously enrich the words in similar lexicons, thus laying the foundation for the unification of labels.
  • the label processing method in the embodiment can find a home set for the label to be processed, and replace the label to be processed with the representative word of the home set, and therefore, represent the same category or content.
  • the same feature can always be described using the same tag, avoiding the problem of multiple synonyms describing the same feature of the same category or content.
  • Another method for label processing disclosed in the embodiment of the present application differs from the above embodiment in that the present embodiment focuses on how to determine similar words of a label to be processed in a similar vocabulary, as shown in FIG.
  • the label processing method described in the example includes the following steps:
  • S202 Find a label to be processed from a similar vocabulary, if the similar vocabulary includes a label to be processed, perform S203 and S206, otherwise, execute S204;
  • S203 A similar word set including a label to be processed is used as a similar word set to which the label to be processed belongs.
  • FIG. 3 the specific process of searching for similar words of a tag to be processed from a similar vocabulary is as shown in FIG. 3, and includes the following steps:
  • S301 Find a word from the corpus that appears in the corpus together with the label to be processed, whose frequency is greater than the first threshold, and is recorded as the first type of related word.
  • corpus can be a text corpus obtained by crawling an Internet website using a crawler.
  • the first type of related words may be determined from the corpus by using an existing association algorithm, and the first type of related words is a word related to the meaning of the label to be processed.
  • S302 Searching in the corpus for a word that appears in a corpus together with the first type of related words and has a frequency greater than a second threshold, and is recorded as a second type of related word.
  • the frequency in which the related words A and B appear together in one corpus the number of simultaneous occurrences of both / the total number of occurrences of the associated word A.
  • the frequency at which the associated words B and A appear together in a corpus The number of simultaneous occurrences of both / the total number of occurrences of the associated word B.
  • S304 The second type of related words whose statistics frequency is less than the third threshold is used as the similar word of the label to be processed.
  • the words related to the meaning of the label to be processed are obtained first, that is, the first type of related words, because the first type of related words are related to the label to be processed, so the similar words of the label to be processed may also be related to A kind of related words are related.
  • the label to be processed is “model”
  • the first type of related word is “T station”
  • “T station” is also the related word of “Madou” similar to “model”.
  • the related words of the first type of related words may include similar words of the to-be-processed tags. Therefore, through the second corpus analysis: the purpose of finding the related words of the first type of related words, that is, the second type of related words, is to find the similarity of the tags to be processed. word.
  • corpus segmentation obtains the first type of related words ⁇ "Taiwan”, “fashion”, “girl”, “magazine”, “model” ⁇ , hypothesis
  • S205 A similar word set of similar words including the label to be processed is taken as a similar word set to which the label to be processed belongs.
  • S206 Treat the representative words of the similar word set to which the tag to be processed belongs to be processed An alternative label for the label.
  • S207 may also be executed before S205 or S206.
  • the label processing method described in this embodiment uses the related words as an intermediary to determine the similar words of the tags to be processed, thereby implementing the unification of the labels.
  • the representative word can be obtained through improvement based on the existing TF-IDF (term frequency - inverse document frequency) algorithm. , that is, according to the method shown in FIG. 4, the representative words of any similar word set are determined:
  • the tag to be processed is a member word, the number of times the member word is incremented by one.
  • S403 Calculate, for each member word, a product of the number of times of the label as a to-be-processed label and the IDF value;
  • the process of determining the representative word may be performed before the label processing process, or may be performed before the label processing process using a representative word (for example, S103 or S205), which is not limited herein.
  • the embodiment of the present application further discloses a label processing apparatus.
  • the method includes: an obtaining module 501, a processing module 502, and a replacement module 503.
  • the obtaining module 501 is configured to acquire a label to be processed.
  • the processing module 502 is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, and the representative Words are selected from the member words.
  • the substitution module 503 is configured to use a representative word of a similar word set to which the tag to be processed belongs as a substitute tag of the to-be-processed tag.
  • the apparatus in this embodiment may further include: a determining module 504, configured to count the number of times each member word is historically used as a label to be processed; and calculate a reverse file frequency IDF value of each member word; The member words are calculated as the product of the number of times the history is to be processed as a label and the IDF value; the member word with the largest product is used as the representative word.
  • a determining module 504 configured to count the number of times each member word is historically used as a label to be processed; and calculate a reverse file frequency IDF value of each member word; The member words are calculated as the product of the number of times the history is to be processed as a label and the IDF value; the member word with the largest product is used as the representative word.
  • the processing module 502 is further configured to: after searching for the similar words of the to-be-processed tag from the similar vocabulary, adding the to-be-processed tag to a similar word including the to-be-processed tag. A collection of similar words.
  • the processing module 502 is further configured to add the to-be-processed tag to the similar word library if the similar word set to which the to-be-processed tag belongs does not exist in the similar thesaurus, the to-be-processed
  • the processed tags are the member words and representative words of the new similar word set in the similar lexicon.
  • the specific implementation manner of the processing module selecting the similar word set to which the to-be-processed tag belongs from the similar vocabulary is: searching for the to-be-processed tag from the similar vocabulary; if the similar vocabulary is Including the to-be-processed tag, the similar word set including the to-be-processed tag is used as the similar word-genus set of the to-be-processed tag; if the similar-word library does not include the to-be-processed tag, Searching for the similar words of the to-be-processed tag from the similar vocabulary, and using the similar word set of the similar words of the to-be-processed tag as the similar word set to which the to-be-processed tag belongs.
  • the specific implementation manner of the processing module searching for the similar words of the to-be-processed tag from the similar vocabulary may be: searching for a first type of related words from the corpus, the first type of related words being the to-be-processed a label that co-occurs in a corpus with a frequency greater than a first threshold; and finds a second type of related word in the corpus, the second type of related word being co-occurring in a corpus with the first type of associated word a word whose frequency is greater than a second threshold; a frequency at which a second type of related word appears together with the to-be-processed tag in a corpus, and a second type of related word whose statistical frequency is less than a third threshold is used as a A similar word for the label being processed.
  • the tag processing apparatus may be disposed on a server for data processing, such as a web server, etc., and is beneficial for converting different tags expressing the same category or content into a unified tag.
  • the embodiment of the present application further discloses a label processing device, as shown in FIG. 6, comprising: a processor 601, a memory 602, a communication interface 603, and a bus 604.
  • the processor 601, the memory 602, and the communication interface 603 communicate via the bus 604.
  • the communication interface 603 is used to implement communication between the tag processing device and other devices.
  • the memory 602 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 602 can store program code of the operating system and other applications as well as application data.
  • the program code stored in the memory 602 is executed by the processor 603.
  • the processor 603 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • Bus 604 can include a path for communicating information between various components, such as processor 601, memory 602, and communication interface 603.
  • the processor 601 executes the program code stored in the memory 602 to implement the following functions: acquiring a label to be processed; selecting a similar word set to which the to-be-processed label belongs from a similar vocabulary, the similar vocabulary including a similar word set, Any one of the similar word sets includes a member word and a representative word, the representative word is selected from the member words; and a representative word of the similar word set to which the to-be-processed tag belongs is used as an alternative to the to-be-processed tag label.
  • the label processing apparatus described in this embodiment facilitates converting different labels that express the same category or content into a unified label.
  • the functions described in the methods of the embodiments of the present application may be stored in a computing device readable storage medium. Based on this It is understood that the part of the embodiment of the present application that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a calculation
  • the device (which may be a personal computer, server, mobile computing device, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Abstract

A label processing method and a device are provided. The method comprises: acquiring to-be-processed label (S101); selecting a similar words set to which the to-be-processed label belongs from a similar words database including the similar words set (S102); and defining a representative word of the similar words set to which the to-be-processed label belongs as the substitute label of the to-be-processed label (S103). Since each similar words set in the similar words database includes member words and the representative word selected from the member words of the similar words set, the method and apparatus described above use one representative word of the similar words set to take the place of the to-be-processed label, and for different but similar labels, the final substitute labels are the same. Therefore, through configuration and maintenance of the similar words set in the similar word database, the application of using uniformed labels to represent the same category or content can be achieved.

Description

一种标签处理方法及装置Label processing method and device
本申请要求于2015年10月30日提交中国专利局、申请号为201510727878.X,发明名称为“一种标签处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201510727878.X filed on Oct. 30, 2015, entitled "A Label Processing Method and Apparatus", the entire contents of which are incorporated herein by reference. In the application.
技术领域Technical field
本申请涉及电子信息领域,尤其涉及一种标签处理方法及装置。The present application relates to the field of electronic information, and in particular, to a label processing method and apparatus.
背景技术Background technique
标签(Tag)用来标志目标的分类或内容,是一种内容组织方式,它是一种特殊的元数据,是标注方对资源主观感受的概括,被用户用于描述和分类资源,以便于检索和分享。Tag is used to mark the classification or content of the target. It is a kind of content organization. It is a kind of special metadata. It is a summary of the subjective feelings of the resource by the labeling party. It is used by users to describe and classify resources so as to facilitate Search and share.
然而,即使对于同一种分类或内容,也存在着多种表达词汇,例如,“时尚”和“时髦”表达一个意思,因此,目前存在的标签系统输出的标签(不管是系统自动生成的标签还是用户自定义的标签)有可能为相似词,也就是说,标签系统可能使用不同的相似标签表示同一个分类或内容,例如,标签系统对于某服装输出的标签,有时为“时尚”而有时为“时髦”。However, even for the same category or content, there are a variety of expression vocabulary, for example, "fashion" and "fashion" express a meaning, therefore, the label currently output by the label system (whether it is automatically generated by the system or User-defined labels) may be similar words, that is, the label system may use different similar labels to represent the same category or content. For example, the label system's label for a garment output is sometimes "fashion" and sometimes "fashionable".
可见,如何使用统一的标签表述同一个分类或内容,成为目前亟待解决的问题。It can be seen that how to use the unified label to express the same classification or content has become an urgent problem to be solved.
发明内容Summary of the invention
本申请提供了一种便签处理方法及装置,目的在于解决如何使用统一的标签表述同一个分类或内容的问题。The present application provides a method and apparatus for processing a note, with the aim of solving the problem of how to use the same label to express the same category or content.
为了实现上述目的,本申请提供了以下技术方案:In order to achieve the above object, the present application provides the following technical solutions:
本申请的第一方面提供了一种标签处理方法,包括:A first aspect of the present application provides a label processing method, including:
获取待处理的标签;Get the label to be processed;
从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从 所述成员词中选出;Selecting a similar word set to which the to-be-processed tag belongs from a similar vocabulary, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, the representative word Selected from the member words;
将所述待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。A representative word of the similar word set to which the to-be-processed tag belongs is used as a substitute tag of the to-be-processed tag.
基于第一方面,在第一方面的第一种实现方式中,所述代表词从所述成员词中选出的具体过程包括:Based on the first aspect, in a first implementation manner of the first aspect, the specific process that the representative word is selected from the member words includes:
统计每个成员词历史上作为待处理的标签的次数;Count the number of times each member word has historically been treated as a tag to be processed;
统计每个成员词的逆向文件频率IDF值;Count the reverse file frequency IDF value of each member word;
对于每个成员词,计算其历史上作为待处理的标签的次数与IDF值的乘积;For each member word, calculate the product of its history as the number of tags to be processed and the IDF value;
将乘积最大的成员词作为代表词。The member word with the largest product is used as the representative word.
基于第一方面或第一方面的第一种实现方式,在第一方面的第二种实现方式中,所述从相似词库中选择所述待处理的标签所属的相似词集合包括:Based on the first aspect or the first implementation of the first aspect, in a second implementation manner of the first aspect, the selecting a similar word set to which the to-be-processed tag belongs from the similar vocabulary includes:
从所述相似词库中查找所述待处理的标签;Finding the to-be-processed tag from the similar vocabulary;
如果所述相似词库中包括所述待处理的标签,将包括所述待处理的标签的相似词集合作为所述待处理的标签所属的相似词集合;If the similar vocabulary includes the to-be-processed tag, the similar word set including the to-be-processed tag is used as a similar word set to which the to-be-processed tag belongs;
如果所述相似词库中不包括所述待处理的标签,从所述相似词库中查找所述待处理的标签的相似词,并将包括所述待处理的标签的相似词的相似词集合作为所述待处理的标签所属的相似词集合。If the similar tag is not included in the similar vocabulary, the similar words of the tag to be processed are searched from the similar vocabulary, and a similar word set of similar words including the tag to be processed is included. As a similar set of words to which the tag to be processed belongs.
基于第一方面的第二种实现方式,在第一方面的第三种实现方式中,所述从所述相似词库中查找所述待处理的标签的相似词包括:Based on the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the searching for the similar words of the to-be-processed label from the similar vocabulary includes:
从语料中查找第一类关联词,所述第一类关联词为与所述待处理的标签共同出现在一篇语料中的频率大于第一阈值的词;Finding a first type of related words from the corpus, the first type of related words being words that appear together with the to-be-processed label in a corpus with a frequency greater than a first threshold;
在所述语料中查找第二类关联词,所述第二类关联词为与所述第一类关联词共同出现在一篇语料中的频率大于第二阈值的词;Searching, in the corpus, a second type of related words, wherein the second type of related words are words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold;
统计第二类关联词与所述待处理的标签共同出现在一篇语料中的频率,将统计的频率小于第三阈值的第二类关联词作为所述待处理的标签的相似词。The frequency of the second type of related words and the to-be-processed tags coexisting in a corpus is counted, and the second type of related words whose statistics are less than the third threshold is used as the similar words of the to-be-processed tags.
基于第一方面或第一方面的第一种实现方式,在第一方面的第四种实现方式中,在从所述相似词库中查找所述待处理的标签的相似词之后,还包括:Based on the first aspect or the first implementation of the first aspect, in a fourth implementation manner of the first aspect, after searching for the similar words of the to-be-processed tag from the similar vocabulary, the method further includes:
将所述待处理的标签加入包括所述待处理的标签的相似词的相似词集合。The tag to be processed is added to a similar set of words including similar words of the tag to be processed.
基于第一方面或第一方面的第一种实现方式,在第一方面的第五种实现方 式中,还包括:Based on the first aspect or the first implementation of the first aspect, the fifth implementation in the first aspect In the formula, it also includes:
如果所述相似词库中不存在所述待处理的标签所属的相似词集合,将所述待处理的标签加入所述相似词库中,所述待处理的标签为所述相似词库中新的相似词集合的成员词和代表词。If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.
本申请的第二方面提供了一种标签的处理装置,包括:A second aspect of the present application provides a processing apparatus for a label, including:
获取模块,用于获取待处理的标签;An obtaining module, configured to obtain a label to be processed;
处理处理模块,用于从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从所述成员词中选出;a processing processing module, configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, The representative word is selected from the member words;
替代模块,用于将所述待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。And an alternative module, configured to use a representative word of the similar word set to which the to-be-processed tag belongs as a substitute tag of the to-be-processed tag.
基于第二方面,在第二方面的第一种实现方式中,还包括:Based on the second aspect, in the first implementation manner of the second aspect, the method further includes:
确定模块,用于统计每个成员词历史上作为待处理的标签的次数;统计每个成员词的逆向文件频率IDF值;对于每个成员词,计算其历史上作为待处理的标签的次数与IDF值的乘积;将乘积最大的成员词作为代表词。a determining module for counting the number of times each member word is historically treated as a tag to be processed; counting the reverse file frequency IDF value of each member word; for each member word, calculating the number of times it is historically treated as a tag to be processed The product of the IDF value; the member word with the largest product is used as the representative word.
基于第二方面或第二方面的第一种实现方式,在第二方面的第二种实现方式中,所述处理模块用于从相似词库中选择所述待处理的标签所属的相似词集合包括:Based on the second aspect or the first implementation of the second aspect, in a second implementation manner of the second aspect, the processing module is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs include:
所述处理模块具体用于,从所述相似词库中查找所述待处理的标签;如果所述相似词库中包括所述待处理的标签,将包括所述待处理的标签的相似词集合作为所述待处理的标签所属的相似词集合;如果所述相似词库中不包括所述待处理的标签,从所述相似词库中查找所述待处理的标签的相似词,并将包括所述待处理的标签的相似词的相似词集合作为所述待处理的标签所属的相似词集合。The processing module is specifically configured to: search for the to-be-processed tag from the similar vocabulary; if the similar vocabulary includes the to-be-processed tag, include a similar word set of the to-be-processed tag As a similar word set to which the to-be-processed tag belongs; if the similar-word library does not include the to-be-processed tag, look up the similar words of the to-be-processed tag from the similar term library, and include The similar word set of the similar words of the to-be-processed tag is used as the similar word set to which the to-be-processed tag belongs.
基于第二方面的第二种实现方式,在第二方面的第三种实现方式中,所述处理模块用于从所述相似词库中查找所述待处理的标签的相似词包括:Based on the second implementation of the second aspect, in a third implementation manner of the second aspect, the processing module is configured to search for the similar words of the to-be-processed tag from the similar vocabulary, including:
所述处理模块具体用于,从语料中查找第一类关联词,所述第一类关联词为与所述待处理的标签共同出现在一篇语料中的频率大于第一阈值的词;并在所述语料中查找第二类关联词,所述第二类关联词为与所述第一类关联词共同 出现在一篇语料中的频率大于第二阈值的词;统计第二类关联词与所述待处理的标签共同出现在一篇语料中的频率,将统计的频率小于第三阈值的第二类关联词作为所述待处理的标签的相似词。The processing module is specifically configured to: search for a first type of related word from the corpus, where the first type of related word is a word that appears in the corpus together with the to-be-processed tag and has a frequency greater than a first threshold; Finding a second type of related words in the corpus, the second type of related words being common to the first type of related words a word that appears in a corpus with a frequency greater than a second threshold; a frequency at which a second type of associated word appears in a corpus together with the tag to be processed, and a second type of associated word whose statistic frequency is less than a third threshold A similar word as the label to be processed.
基于第二方面或第二方面的第一种实现方式,在第二方面的第四种实现方式中,所述处理模块还用于:Based on the second aspect or the first implementation of the second aspect, in a fourth implementation manner of the second aspect, the processing module is further configured to:
在从所述相似词库中查找所述待处理的标签的相似词之后,将所述待处理的标签加入包括所述待处理的标签的相似词的相似词集合。After searching for similar words of the to-be-processed tag from the similar vocabulary, the tag to be processed is added to a similar word set including similar words of the tag to be processed.
基于第二方面或第二方面的第一种实现方式,在第二方面的第五种实现方式中,所述处理模块还用于:Based on the second aspect or the first implementation of the second aspect, in a fifth implementation manner of the second aspect, the processing module is further configured to:
如果所述相似词库中不存在所述待处理的标签所属的形似词集合,将所述待处理的标签加入所述相似词库中,所述待处理的标签为所述相似词库中新的相似词集合的成员词和代表词。If the similar word pool does not exist in the similar vocabulary, the label to be processed is added to the similar vocabulary, and the label to be processed is new in the similar vocabulary. Member words and representative words of similar word sets.
本申请所述的标签处理方法及装置,获取待处理的标签,从包括相似词集合的相似词库中选择待处理的标签所属的相似词集合,将待处理的标签所属的相似词集合的代表词作为待处理的标签的替代标签,因为相似词库中任意一个相似词集合均包括成员词和从本相似词集合中的成员词中选出的代表词,所以,本申请所述的方法及装置,使用一个相似词集合的代表词代替待处理的标签,因此,对于不同但相似的标签,最终的替代标签是相同的(即相似词组的代表词),所以,通过对相似词库中的相似词集合的配置及维护,即可实现使用统一的标签表述同一个分类或内容的目的。The label processing method and device of the present application acquires a tag to be processed, selects a similar word set to which the tag to be processed belongs from a similar vocabulary including a similar word set, and represents a similar word set to which the tag to be processed belongs. The word is used as a substitute label for the label to be processed, because any similar word set in the similar vocabulary includes the member word and the representative word selected from the member words in the similar word set, so the method and method described in the present application The device replaces the tag to be processed with a representative word of a similar word set, so for different but similar tags, the final surrogate tag is the same (ie, the representative word of a similar phrase), so, by using in a similar lexicon The configuration and maintenance of a similar set of words can achieve the purpose of expressing the same classification or content using a uniform label.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.
图1为本申请实施例公开的一种标签处理方法流程图;FIG. 1 is a flowchart of a label processing method according to an embodiment of the present application;
图2为本申请实施例公开的又一种标签处理方法的流程图;2 is a flowchart of still another method for processing a label according to an embodiment of the present application;
图3为本申请实施例公开的从相似词库中查找待处理的标签的相似词的具 体过程的流程图;FIG. 3 is a schematic diagram of searching for similar words of a tag to be processed from a similar vocabulary according to an embodiment of the present application; Flow chart of the body process;
图4为本申请实施例公开的从成员词中选择代表词的具体过程的流程图;4 is a flowchart of a specific process for selecting a representative word from a member word according to an embodiment of the present application;
图5为本申请实施例公开的一种标签处理装置的结构示意图;FIG. 5 is a schematic structural diagram of a label processing apparatus according to an embodiment of the present disclosure;
图6为本申请实施例公开的一种标签处理设备的结构示意图。FIG. 6 is a schematic structural diagram of a label processing device according to an embodiment of the present disclosure.
具体实施方式detailed description
本申请实施例可以应用在标签设置或输出的场景中,例如,服务器为某个互联网站输出一个标签,或者,为运营商用户标注行为偏好标签,本申请的方法的执行主体可以为服务器。The embodiment of the present application can be applied to a scenario in which a tag is set or output. For example, the server outputs a label for an Internet site, or the operator user is marked with a behavior preference tag. The execution body of the method of the present application may be a server.
本实施例所述的标签处理方法的执行主体可以为设置或输出标签的服务器。The execution body of the label processing method described in this embodiment may be a server that sets or outputs a label.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
本申请实施例公开的一种标签处理方法,如图1所示,包括以下步骤:A label processing method disclosed in the embodiment of the present application, as shown in FIG. 1 , includes the following steps:
S101:获取待处理的标签。S101: Acquire a label to be processed.
待处理的标签可以为用户设置的标签,也可以为标签系统生成的标签。The tag to be processed can be a tag set by the user or a tag generated by the tag system.
S102:判断是否从包括相似词集合的相似词库中选择到待处理的标签所属的相似词集合,如果是,执行S103,如果否,执行S104。S102: Determine whether a similar word set to which the tag to be processed belongs is selected from a similar vocabulary including a similar word set, and if yes, execute S103, and if no, execute S104.
本实施例中,待处理的标签所属的相似词集合为包括待处理的标签或者待处理的标签的相似词的相似词集合。In this embodiment, the similar word set to which the tag to be processed belongs is a similar word set including similar words of the tag to be processed or the tag to be processed.
其中,相似词库中包括的任意一个相似词集合均包括成员词和代表词,本集合的代表词从本集合的成员词中选出。Wherein, any similar word set included in the similar vocabulary includes a member word and a representative word, and the representative words of the set are selected from the member words of the set.
例如,如表1所示为相似词库的一部分,其中包括编号为0001至0004的四个相似词集合,每个相似词集合中包括成员词和代表词,如编号为0001的相似词组中,“模特”、“麻豆”、“model”均为成员词,“模特”为代表词。For example, as shown in Table 1, a part of a similar lexicon, including four similar word sets numbered 0001 to 0004, each of which includes a member word and a representative word, such as a similar phrase numbered 0001, “Model”, “Madou” and “model” are all member words, and “model” is a representative word.
表1 Table 1
组编号Group number 成员词1Member word 1 成员词2Member word 2 成员词3Member word 3 ... 代表词Representative word
00010001 模特Model 麻豆Madou modelModel ... 模特Model
00020002 笑话joke 冷笑话cold joke   ... 笑话joke
00030003 时尚fashion 时髦fashionable 前卫Avantgarde ... 时尚fashion
00040004 地瓜Sweet potato 红薯sweet potato 番薯Sweet potato ... 红薯sweet potato
... ... ... ... ...  
具体地,可以从本组的成员词中任意选择一个作为本组的代表词,也可以依据其它规则从本组的成员词中选出代表词,具体的规则在下面实施例中进行说明。Specifically, one of the member words of the group may be arbitrarily selected as the representative word of the group, or the representative words may be selected from the group words of the group according to other rules. Specific rules are described in the following embodiments.
本实施例中,相似词为意思相似的词语,即近义词,其具体的定义可以参见现有技术。In this embodiment, the similar words are similar words, that is, synonyms, and the specific definitions thereof can be referred to the prior art.
S103:将待处理的标签所属的相似词集合的代表词作为待处理的标签的替代标签。S103: The representative word of the similar word set to which the tag to be processed belongs is used as a substitute tag of the tag to be processed.
例如,待处理的标签为“麻豆”,因为编号为0001的相似词集合中包括待处理的标签为“麻豆”,则将0001集合的代表词“模特”作为替代标签。For example, the label to be processed is “Madou”, because the similar word set numbered 0001 includes the label to be processed as “Madou”, and the representative word “Model” of the 0001 collection is used as a substitute label.
S104:如果相似词库中不存在待处理的标签所属的相似词集合,则将待处理的标签加入相似词库中,待处理的标签为相似词库中新的相似词组的成员词和代表词。S104: If there is no similar word set to which the to-be-processed tag belongs in the similar lexicon, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is a member word and representative word of the new similar phrase in the similar vocabulary. .
也就是说,如果在相似性词库中没有找到待处理的标签或其相似词,则表示相似性词库中不存在标签或者与标签的含义相似的词语,在此情况下,将待处理的标签作为新的词语加入相似词库中,且作为相似词库中的新的相似词组的成员词,因为新的相似词组中没有其它词语,所以,待处理的标签也同时作为相似词组的代表词。That is to say, if the label to be processed or its similar words is not found in the similarity vocabulary, it means that there is no label in the similarity vocabulary or a word similar to the meaning of the label, in this case, the pending The tag is added to the similar lexicon as a new word, and as a member of the new similar phrase in the similar lexicon, because there are no other words in the new similar phrase, the tag to be processed also serves as the representative of the similar phrase. .
例如,待处理的标签为“艺术”,而表1中没有“艺术”的相似词,则将“艺术”作为新的相似词组0005的成员词和代表词。更新后的相似词库如表2所示。For example, if the label to be processed is "art" and there is no similar word in "art" in Table 1, "art" is taken as the member word and representative of the new similar phrase 0005. The updated similar vocabulary is shown in Table 2.
表2Table 2
组编号Group number 成员词1Member word 1 成员词2Member word 2 成员词3Member word 3 ... 代表词Representative word
00010001 模特Model 麻豆Madou modelModel ... 模特Model
00020002 笑话joke 冷笑话cold joke 段子Paragraph ... 笑话joke
00030003 时尚fashion 时髦fashionable 前卫Avantgarde ... 时尚fashion
00040004 地瓜Sweet potato 红薯sweet potato 番薯Sweet potato ... 红薯sweet potato
00050005 艺术art ... ... ... 艺术art
S104的目的在于,不断丰富相似词库中的词语,从而为标签的统一化奠定基础。The purpose of S104 is to continuously enrich the words in similar lexicons, thus laying the foundation for the unification of labels.
从上述步骤可以看出,本实施例中所述的标签处理方法,可以为待处理的标签找到归属集合,并使用归属集合的代表词替换待处理的标签,因此,对于表示同一个分类或内容的同一个特征,可以总是使用同一个标签进行描述,而避免出现多个近义词描述同一个分类或内容的同一个特征的问题。It can be seen from the foregoing steps that the label processing method in the embodiment can find a home set for the label to be processed, and replace the label to be processed with the representative word of the home set, and therefore, represent the same category or content. The same feature can always be described using the same tag, avoiding the problem of multiple synonyms describing the same feature of the same category or content.
本申请实施例公开的又一种标签处理方法,与上述实施例的区别在于,本实施例中重点说明如何在相似词库中确定待处理的标签的相似词,如图2所示,本实施例所述的标签处理方法包括以下步骤:Another method for label processing disclosed in the embodiment of the present application differs from the above embodiment in that the present embodiment focuses on how to determine similar words of a label to be processed in a similar vocabulary, as shown in FIG. The label processing method described in the example includes the following steps:
S201:获取待处理的标签;S201: Acquire a label to be processed;
S202:从相似词库中查找待处理的标签,如果相似词库中包括待处理的标签,执行S203和S206,否则,执行S204;S202: Find a label to be processed from a similar vocabulary, if the similar vocabulary includes a label to be processed, perform S203 and S206, otherwise, execute S204;
S203:将包括待处理的标签的相似词集合作为待处理的标签所属的相似词集合。S203: A similar word set including a label to be processed is used as a similar word set to which the label to be processed belongs.
S204:从相似词库中查找待处理的标签的相似词,如果从相似词库中找到待处理的标签的相似词,执行S205,否则,执行S208;S204: Find similar words of the tag to be processed from the similar lexicon, if the similar words of the tag to be processed are found from the similar vocabulary, execute S205, otherwise, execute S208;
具体地,从相似词库中查找待处理的标签的相似词的具体地过程如图3所示,包括以下步骤:Specifically, the specific process of searching for similar words of a tag to be processed from a similar vocabulary is as shown in FIG. 3, and includes the following steps:
S301:从语料中查找与待处理的标签共同出现在一篇语料中的频率大于第一阈值的词,记为第一类关联词。S301: Find a word from the corpus that appears in the corpus together with the label to be processed, whose frequency is greater than the first threshold, and is recorded as the first type of related word.
本实施例中,所谓关联词是指含义相关的词语。一个文本集合称为语料库,语料一般指用做文本分析的材料,通常,语料可以为使用爬虫爬取互联网网站得到的文本语料。可以采用现有的关联算法从语料中确定第一类关联词,第一类关联词即为与待处理的标签的含义相关的词。In the present embodiment, the term "related words" refers to words that are related to meaning. A collection of texts is called a corpus, and corpus generally refers to materials used for text analysis. Usually, corpus can be a text corpus obtained by crawling an Internet website using a crawler. The first type of related words may be determined from the corpus by using an existing association algorithm, and the first type of related words is a word related to the meaning of the label to be processed.
S302:在语料中查找与第一类关联词共同出现在一篇语料中的频率大于第二阈值的词,记为第二类关联词。S302: Searching in the corpus for a word that appears in a corpus together with the first type of related words and has a frequency greater than a second threshold, and is recorded as a second type of related word.
本实施例中,关联词A与B共同出现在一篇语料中的频率=两者同时出现的次数/关联词A出现的总次数。关联词B与A共同出现在一篇语料中的频率= 两者同时出现的次数/关联词B出现的总次数。In this embodiment, the frequency in which the related words A and B appear together in one corpus = the number of simultaneous occurrences of both / the total number of occurrences of the associated word A. The frequency at which the associated words B and A appear together in a corpus = The number of simultaneous occurrences of both / the total number of occurrences of the associated word B.
S303:统计第二类关联词与待处理的标签共同出现在一篇语料中的频率。S303: Count the frequency of the second type of related words and the tags to be processed together in a corpus.
S304:将统计的频率小于第三阈值的第二类关联词作为待处理的标签的相似词。S304: The second type of related words whose statistics frequency is less than the third threshold is used as the similar word of the label to be processed.
图3所示的步骤进行了三次语料分析,即:The steps shown in Figure 3 were analyzed three times, namely:
通过第一次语料分析,先获得与待处理的标签含义相关的词语,即第一类关联词,因为第一类关联词与待处理的标签相关,所以,待处理的标签的相似词也可能与第一类关联词相关,例如,待处理的标签为“模特”,其第一类关联词为“T台”,“T台”同时也是“模特”的相似词“麻豆”的关联词。Through the first corpus analysis, the words related to the meaning of the label to be processed are obtained first, that is, the first type of related words, because the first type of related words are related to the label to be processed, so the similar words of the label to be processed may also be related to A kind of related words are related. For example, the label to be processed is “model”, and the first type of related word is “T station”, and “T station” is also the related word of “Madou” similar to “model”.
可见,第一类关联词的关联词中可能包括待处理标签的相似词,因此,通过第二次语料分析:查找第一类关联词的关联词即第二类关联词的目的在于,查找待处理的标签的相似词。It can be seen that the related words of the first type of related words may include similar words of the to-be-processed tags. Therefore, through the second corpus analysis: the purpose of finding the related words of the first type of related words, that is, the second type of related words, is to find the similarity of the tags to be processed. word.
因为相似词共同出现在一篇语料中的可能性不大,例如,一篇语料中同时出现“模特”和“麻豆”的可能性不大,而统一使用一个词语(“模特”或“麻豆”)表达同一个含义的可能性较大,因此,在第三次语料分析中,统计第二类关联词与待处理标签共同出现在一篇语料中的频率,将频率小于一定阈值的词作为待处理的标签的相似词。Because it is unlikely that similar words will appear together in a corpus, for example, it is unlikely that a “model” and “madou” will appear in a corpus, and a single word (“model” or “hemp” is used uniformly. Bean") is more likely to express the same meaning. Therefore, in the third corpus analysis, the frequency of the second type of related words and the tags to be processed appear together in a corpus, and the words whose frequency is less than a certain threshold are used as Similar words for the tag to be processed.
具体地,使用上述方法查找“麻豆”的相似词的具体过程为:语料分词得到第一类关联词{“T台”,“时尚”,“女孩”,“杂志”,“模特”},假设各个第一类关联词与“麻豆”共同出现在同一篇语料中的频率分别为:T台0.50、时尚0.43、女孩0.10、杂志0.17、模特0.09;设项集频率大于0.2的第一类关联词组成关联词组C,得C={“T台”,“时尚”};将“T台”,“时尚”都做关联分析,2个词分别做关联分析后把得到的词放在一个集合里面,得到集合E={“模特”,“model”,“女孩”},计算集合E中的每一个词语与“麻豆”共同出现在同一篇语料中的频率,假设为:“模特”0.11、“model”0.60,“女孩”0.70;再次进行阈值筛选(该阈值假设为0.5)得到{“模特”},“model”,“女孩”被滤掉,“模特”即为“麻豆”的相似词。Specifically, the specific process of searching for similar words of "Madou" using the above method is: corpus segmentation obtains the first type of related words {"Taiwan", "fashion", "girl", "magazine", "model"}, hypothesis The frequencies of the first type of related words and "Madou" appear together in the same corpus are: T stage 0.50, fashion 0.43, girl 0.10, magazine 0.17, model 0.09; the first type of related words with the item set frequency greater than 0.2 Correlation phrase C, get C = {"T stage", "fashion"}; "T stage", "fashion" are all related to the analysis, the two words are respectively related to the analysis and put the obtained words in a collection, Get the set E = {"model", "model", "girl"}, calculate the frequency of each word in the set E and "Madou" appear together in the same corpus, assuming: "model" 0.11, " Model"0.60, "Girl" 0.70; again threshold screening (the threshold is assumed to be 0.5) to get {"model"}, "model", "girl" is filtered out, "model" is the similar word of "Madou" .
S205:将包括待处理的标签的相似词的相似词集合作为待处理的标签所属的相似词集合。S206:将待处理的标签所属的相似词集合的代表词作为待处理 的标签的替代标签。S205: A similar word set of similar words including the label to be processed is taken as a similar word set to which the label to be processed belongs. S206: Treat the representative words of the similar word set to which the tag to be processed belongs to be processed An alternative label for the label.
S207:将待处理的标签加入包括待处理的标签的相似词的相似词集合。S207: Add the label to be processed to a similar word set including similar words of the label to be processed.
需要说明的是,S207也可以在S205或S206之前执行。It should be noted that S207 may also be executed before S205 or S206.
S208:将待处理的标签加入相似词库中,待处理的标签为相似词库中新的相似词集合的成员词和代表词。S208: Add the to-be-processed tag into a similar vocabulary, and the to-be-processed tag is a member word and a representative word of a new similar word set in a similar vocabulary.
本实施例中所述的标签处理方法,利用关联词作为中介,确定出待处理的标签的相似词,从而进实现标签的统一化。The label processing method described in this embodiment uses the related words as an intermediary to determine the similar words of the tags to be processed, thereby implementing the unification of the labels.
需要说明的是,本实施例中,先查找相似词库中是否包括待处理的标签,在相似词库中不包括待处理的标签的情况下,再查找相似词库中是否包括待处理的标签的相似词,如此查找顺序的目的在于,提高查找效率,即先执行简单的查找操作,如果没有找到,再进行复杂的查找操作。除此以外,也可以使用其它查找方式,这里不做限定。It should be noted that, in this embodiment, whether the label to be processed is included in the similar vocabulary is searched, and if the label to be processed is not included in the similar vocabulary, whether the similar vocabulary includes the label to be processed is found. Similar words, the purpose of this search order is to improve the efficiency of the search, that is, to perform a simple search operation, and if not found, then perform a complex search operation. In addition, other search methods can also be used, which are not limited herein.
需要说明的是,对于以上实施例所示的相似词库而言,可以基于现有的TF-IDF(检索词频率-逆向文件频率,term frequency-inverse document frequency)算法,通过改进后得到代表词,即依据图4所示的方法确定任意一个相似词集合的代表词:It should be noted that, for the similar thesaurus shown in the above embodiment, the representative word can be obtained through improvement based on the existing TF-IDF (term frequency - inverse document frequency) algorithm. , that is, according to the method shown in FIG. 4, the representative words of any similar word set are determined:
S401:统计每个成员词历史上作为待处理的标签的次数;S401: Count the number of times each member word is historically regarded as a to-be-processed tag;
也就是说,如果待处理的标签为某个成员词,则此成员词的次数加1。That is, if the tag to be processed is a member word, the number of times the member word is incremented by one.
S402:统计每个成员词的逆向文件频率(inverse document frequency,IDF)值;S402: Count an inverse document frequency (IDF) value of each member word;
S403:对于每个成员词,计算其历史上作为待处理的标签的次数与IDF值的乘积;S403: Calculate, for each member word, a product of the number of times of the label as a to-be-processed label and the IDF value;
S404:将乘积最大的成员词作为代表词。S404: The member word with the largest product is used as a representative word.
上述确定代表词的过程,可以在标签处理过程之前进行,也可以在标签处理过程中,使用代表词(例如S103或S205)之前进行,这里不做限定。The process of determining the representative word may be performed before the label processing process, or may be performed before the label processing process using a representative word (for example, S103 or S205), which is not limited herein.
与上述方法实施例相对应地,本申请实施例还公开了一种标签处理装置, 如图5所示,包括:获取模块501、处理模块502以及替代模块503。Corresponding to the above method embodiment, the embodiment of the present application further discloses a label processing apparatus. As shown in FIG. 5, the method includes: an obtaining module 501, a processing module 502, and a replacement module 503.
其中,获取模块501用于获取待处理的标签。The obtaining module 501 is configured to acquire a label to be processed.
处理模块502用于从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从所述成员词中选出。The processing module 502 is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, and the representative Words are selected from the member words.
替代模块503用于将待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。The substitution module 503 is configured to use a representative word of a similar word set to which the tag to be processed belongs as a substitute tag of the to-be-processed tag.
可选地,本实施例所述的装置还可以包括:确定模块504,用于统计每个成员词历史上作为待处理的标签的次数;统计每个成员词的逆向文件频率IDF值;对于每个成员词,计算其历史上作为历史待处理的标签的次数与IDF值的乘积;将乘积最大的成员词作为代表词。Optionally, the apparatus in this embodiment may further include: a determining module 504, configured to count the number of times each member word is historically used as a label to be processed; and calculate a reverse file frequency IDF value of each member word; The member words are calculated as the product of the number of times the history is to be processed as a label and the IDF value; the member word with the largest product is used as the representative word.
可选地,处理模块502还可以用于在从所述相似词库中查找所述待处理的标签的相似词之后,将所述待处理的标签加入包括所述待处理的标签的相似词的相似词集合。Optionally, the processing module 502 is further configured to: after searching for the similar words of the to-be-processed tag from the similar vocabulary, adding the to-be-processed tag to a similar word including the to-be-processed tag. A collection of similar words.
可选地,处理模块502还可以用于如果所述相似词库中不存在所述待处理的标签所属的相似词集合,将所述待处理的标签加入所述相似词库中,所述待处理的标签为所述相似词库中新的相似词集合的成员词和代表词。Optionally, the processing module 502 is further configured to add the to-be-processed tag to the similar word library if the similar word set to which the to-be-processed tag belongs does not exist in the similar thesaurus, the to-be-processed The processed tags are the member words and representative words of the new similar word set in the similar lexicon.
具体地,处理模块从相似词库中选择所述待处理的标签所属的相似词集合的具体实现方式为:从所述相似词库中查找所述待处理的标签;如果所述相似词库中包括所述待处理的标签,将包括所述待处理的标签的相似词集合作为所述待处理的标签所的相似词属集合;如果所述相似词库中不包括所述待处理的标签,从所述相似词库中查找所述待处理的标签的相似词,并将包括所述待处理的标签的相似词的相似词集合作为所述待处理的标签所属的相似词集合。Specifically, the specific implementation manner of the processing module selecting the similar word set to which the to-be-processed tag belongs from the similar vocabulary is: searching for the to-be-processed tag from the similar vocabulary; if the similar vocabulary is Including the to-be-processed tag, the similar word set including the to-be-processed tag is used as the similar word-genus set of the to-be-processed tag; if the similar-word library does not include the to-be-processed tag, Searching for the similar words of the to-be-processed tag from the similar vocabulary, and using the similar word set of the similar words of the to-be-processed tag as the similar word set to which the to-be-processed tag belongs.
进一步地,处理模块从所述相似词库中查找所述待处理的标签的相似词的具体实现方式可以为:从语料中查找第一类关联词,所述第一类关联词为与所述待处理的标签共同出现在一篇语料中的频率大于第一阈值的词;并在所述语料中查找第二类关联词,所述第二类关联词为与第一类关联词共同出现在一篇语料中的频率大于第二阈值的词;统计第二类关联词与所述待处理的标签共同出现在一篇语料中的频率,将统计的频率小于第三阈值的第二类关联词作为所 述待处理的标签的相似词。Further, the specific implementation manner of the processing module searching for the similar words of the to-be-processed tag from the similar vocabulary may be: searching for a first type of related words from the corpus, the first type of related words being the to-be-processed a label that co-occurs in a corpus with a frequency greater than a first threshold; and finds a second type of related word in the corpus, the second type of related word being co-occurring in a corpus with the first type of associated word a word whose frequency is greater than a second threshold; a frequency at which a second type of related word appears together with the to-be-processed tag in a corpus, and a second type of related word whose statistical frequency is less than a third threshold is used as a A similar word for the label being processed.
本实施例所述的标签处理装置,可以设置在用于数据处理的服务器上,例如网页服务器等,有利于将表述同一类别或内容的不同的标签转换为统一的标签。The tag processing apparatus according to this embodiment may be disposed on a server for data processing, such as a web server, etc., and is beneficial for converting different tags expressing the same category or content into a unified tag.
本申请实施例还公开了一种标签处理设备,如图6所示,包括:处理器601、存储器602、通信接口603以及总线604。The embodiment of the present application further discloses a label processing device, as shown in FIG. 6, comprising: a processor 601, a memory 602, a communication interface 603, and a bus 604.
其中,处理器601、存储器602以及通信接口603通过总线604进行通信。通信接口603用于实现标签处理设备与其它设备进行通信。The processor 601, the memory 602, and the communication interface 603 communicate via the bus 604. The communication interface 603 is used to implement communication between the tag processing device and other devices.
存储器602可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器602可以存储操作系统和其他应用程序的程序代码以及应用数据。存储器602中存储的程序代码由处理器603来运行执行。The memory 602 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 602 can store program code of the operating system and other applications as well as application data. The program code stored in the memory 602 is executed by the processor 603.
处理器603可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序。The processor 603 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
总线604可包括一个通路,在各个部件(例如处理器601、存储器602和通信接口603)之间传送信息。 Bus 604 can include a path for communicating information between various components, such as processor 601, memory 602, and communication interface 603.
处理器601执行存储器602中存储的程序代码实现以下功能:获取待处理的标签;从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从所述成员词中选出;将所述待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。The processor 601 executes the program code stored in the memory 602 to implement the following functions: acquiring a label to be processed; selecting a similar word set to which the to-be-processed label belongs from a similar vocabulary, the similar vocabulary including a similar word set, Any one of the similar word sets includes a member word and a representative word, the representative word is selected from the member words; and a representative word of the similar word set to which the to-be-processed tag belongs is used as an alternative to the to-be-processed tag label.
处理器601实现以上功能的具体实现方式可以参见图1、图2、图3及图4所示的步骤。For the specific implementation of the above functions of the processor 601, refer to the steps shown in FIG. 1, FIG. 2, FIG. 3 and FIG.
本实施例所述的标签处理设备有利于将表述同一类别或内容的不同的标签转换为统一的标签。The label processing apparatus described in this embodiment facilitates converting different labels that express the same category or content into a unified label.
本申请实施例方法所述的功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算设备可读取存储介质中。基于这样 的理解,本申请实施例对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一台计算设备(可以是个人计算机,服务器,移动计算设备或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions described in the methods of the embodiments of the present application, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computing device readable storage medium. Based on this It is understood that the part of the embodiment of the present application that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a calculation The device (which may be a personal computer, server, mobile computing device, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts of the respective embodiments may be referred to each other.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。 The above description of the disclosed embodiments enables those skilled in the art to make or use the application. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the application is not limited to the embodiments shown herein, but is to be accorded the broadest scope of the principles and novel features disclosed herein.

Claims (12)

  1. 一种标签处理方法,其特征在于,包括:A label processing method, comprising:
    获取待处理的标签;Get the label to be processed;
    从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从所述成员词中选出;Selecting a similar word set to which the to-be-processed tag belongs from a similar vocabulary, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, the representative word from the member Elected in the word;
    将所述待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。A representative word of the similar word set to which the to-be-processed tag belongs is used as a substitute tag of the to-be-processed tag.
  2. 根据权利要求1所述的方法,其特征在于,所述代表词从所述成员词中选出的具体过程包括:The method according to claim 1, wherein the specific process of selecting the representative word from the member words comprises:
    统计每个成员词历史上作为待处理的标签的次数;Count the number of times each member word has historically been treated as a tag to be processed;
    统计每个成员词的逆向文件频率IDF值;Count the reverse file frequency IDF value of each member word;
    对于每个成员词,计算其历史上作为待处理的标签的次数与IDF值的乘积;For each member word, calculate the product of its history as the number of tags to be processed and the IDF value;
    将乘积最大的成员词作为代表词。The member word with the largest product is used as the representative word.
  3. 根据权利要求1或2任一项所述的方法,其特征在于,所述从相似词库中选择所述待处理的标签所属的相似词集合包括:The method according to any one of claims 1 to 2, wherein the selecting a similar word set to which the to-be-processed tag belongs from the similar vocabulary comprises:
    从所述相似词库中查找所述待处理的标签;Finding the to-be-processed tag from the similar vocabulary;
    如果所述相似词库中包括所述待处理的标签,将包括所述待处理的标签的相似词集合作为所述待处理的标签所属的相似词集合;If the similar vocabulary includes the to-be-processed tag, the similar word set including the to-be-processed tag is used as a similar word set to which the to-be-processed tag belongs;
    如果所述相似词库中不包括所述待处理的标签,从所述相似词库中查找所述待处理的标签的相似词,并将包括所述待处理的标签的相似词的相似词集合作为所述待处理的标签所属的相似词集合。If the similar tag is not included in the similar vocabulary, the similar words of the tag to be processed are searched from the similar vocabulary, and a similar word set of similar words including the tag to be processed is included. As a similar set of words to which the tag to be processed belongs.
  4. 根据权利要求3所述的方法,其特征在于,所述从所述相似词库中查找所述待处理的标签的相似词包括:The method according to claim 3, wherein said searching for similar words of said to-be-processed tag from said similar vocabulary comprises:
    从语料中查找第一类关联词,所述第一类关联词为与所述待处理的标签共同出现在一篇语料中的频率大于第一阈值的词;Finding a first type of related words from the corpus, the first type of related words being words that appear together with the to-be-processed label in a corpus with a frequency greater than a first threshold;
    在所述语料中查找第二类关联词,所述第二类关联词为与所述第一类关联词共同出现在一篇语料中的频率大于第二阈值的词; Searching, in the corpus, a second type of related words, wherein the second type of related words are words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold;
    统计第二类关联词与所述待处理的标签共同出现在一篇语料中的频率,将统计的频率小于第三阈值的第二类关联词作为所述待处理的标签的相似词。The frequency of the second type of related words and the to-be-processed tags coexisting in a corpus is counted, and the second type of related words whose statistics are less than the third threshold is used as the similar words of the to-be-processed tags.
  5. 根据权利要求3所述的方法,其特征在于,在从所述相似词库中查找所述待处理的标签的相似词之后,还包括:The method according to claim 3, further comprising: after searching for the similar words of the tag to be processed from the similar vocabulary, further comprising:
    将所述待处理的标签加入包括所述待处理的标签的相似词的相似词集合。The tag to be processed is added to a similar set of words including similar words of the tag to be processed.
  6. 根据权利要求1或2所述的方法,其特征在于,还包括:The method according to claim 1 or 2, further comprising:
    如果所述相似词库中不存在所述待处理的标签所属的相似词集合,将所述待处理的标签加入所述相似词库中,所述待处理的标签为所述相似词库中新的相似词集合的成员词和代表词。If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.
  7. 一种标签的处理装置,其特征在于,包括:A processing device for a tag, comprising:
    获取模块,用于获取待处理的标签;An obtaining module, configured to obtain a label to be processed;
    处理模块,用于从相似词库中选择所述待处理的标签所属的相似词集合,所述相似词库中包括相似词集合,任意一个相似词集合均包括成员词和代表词,所述代表词从所述成员词中选出;a processing module, configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, and the representative Words are selected from the member words;
    替代模块,用于将所述待处理的标签所属的相似词集合的代表词作为所述待处理的标签的替代标签。And an alternative module, configured to use a representative word of the similar word set to which the to-be-processed tag belongs as a substitute tag of the to-be-processed tag.
  8. 根据权利要求7所述的装置,其特征在于,还包括:The device according to claim 7, further comprising:
    确定模块,用于统计每个成员词历史上作为待处理的标签的次数;统计每个成员词的逆向文件频率IDF值;对于每个成员词,计算其历史上作为待处理的标签的次数与IDF值的乘积;将乘积最大的成员词作为代表词。a determining module for counting the number of times each member word is historically treated as a tag to be processed; counting the reverse file frequency IDF value of each member word; for each member word, calculating the number of times it is historically treated as a tag to be processed The product of the IDF value; the member word with the largest product is used as the representative word.
  9. 根据权利要求7或8所述的装置,其特征在于,所述处理模块用于从相似词库中选择所述待处理的标签所属的相似词集合包括:The device according to claim 7 or 8, wherein the processing module is configured to select, from the similar vocabulary, the similar word set to which the to-be-processed tag belongs:
    所述处理模块具体用于,从所述相似词库中查找所述待处理的标签;如果所述相似词库中包括所述待处理的标签,将包括所述待处理的标签的相似词集合作为所述待处理的标签所属的相似词集合;如果所述相似词库中不包括所述待处理的标签,从所述相似词库中查找所述待处理的标签的相似词,并将包括所述待处理的标签的相似词的相似词集合作为所述待处理的标签所属的相似词集合。The processing module is specifically configured to: search for the to-be-processed tag from the similar vocabulary; if the similar vocabulary includes the to-be-processed tag, include a similar word set of the to-be-processed tag As a similar word set to which the to-be-processed tag belongs; if the similar-word library does not include the to-be-processed tag, look up the similar words of the to-be-processed tag from the similar term library, and include The similar word set of the similar words of the to-be-processed tag is used as the similar word set to which the to-be-processed tag belongs.
  10. 根据权利要求9所述的装置,其特征在于,所述处理模块用于从所述 相似词库中查找所述待处理的标签的相似词包括:The apparatus of claim 9 wherein said processing module is for said Similar words in the similar lexicon that look up the tag to be processed include:
    所述处理模块具体用于,从语料中查找第一类关联词,所述第一类关联词为与所述待处理的标签共同出现在一篇语料中的频率大于第一阈值的词;并在所述语料中查找第二类关联词,所述第二类关联词为与所述第一类关联词共同出现在一篇语料中的频率大于第二阈值的词;统计第二类关联词与所述待处理的标签共同出现在一篇语料中的频率,将统计的频率小于第三阈值的第二类关联词作为所述待处理的标签的相似词。The processing module is specifically configured to: search for a first type of related word from the corpus, where the first type of related word is a word that appears in the corpus together with the to-be-processed tag and has a frequency greater than a first threshold; Searching for a second type of related words in the corpus, the second type of related words being words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold; statistical second type of related words and the to-be-processed The frequency at which the tags co-occur in a corpus, and the second type of related words whose statistic frequency is less than the third threshold is used as the similar word of the tag to be processed.
  11. 根据权利要求9所述的装置,其特征在于,所述处理模块还用于:The device according to claim 9, wherein the processing module is further configured to:
    在从所述相似词库中查找所述待处理的标签的相似词之后,将所述待处理的标签加入包括所述待处理的标签的相似词的相似词集合。After searching for similar words of the to-be-processed tag from the similar vocabulary, the tag to be processed is added to a similar word set including similar words of the tag to be processed.
  12. 根据权利要求7或8所述的装置,其特征在于,所述处理模块还用于:The device according to claim 7 or 8, wherein the processing module is further configured to:
    如果所述相似词库中不存在所述待处理的标签所属的相似词集合,将所述待处理的标签加入所述相似词库中,所述待处理的标签为所述相似词库中新的相似词集合的成员词和代表词。 If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.
PCT/CN2016/094417 2015-10-30 2016-08-10 Label processing method and device WO2017071370A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510727878.X 2015-10-30
CN201510727878 2015-10-30

Publications (1)

Publication Number Publication Date
WO2017071370A1 true WO2017071370A1 (en) 2017-05-04

Family

ID=58629832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/094417 WO2017071370A1 (en) 2015-10-30 2016-08-10 Label processing method and device

Country Status (1)

Country Link
WO (1) WO2017071370A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353071A (en) * 2018-12-05 2020-06-30 阿里巴巴集团控股有限公司 Label generation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760142A (en) * 2011-04-29 2012-10-31 北京百度网讯科技有限公司 Method and device for extracting subject label in search result aiming at searching query
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN104268292A (en) * 2014-10-23 2015-01-07 广州索答信息科技有限公司 Label word library update method of portrait system
CN104715049A (en) * 2015-03-26 2015-06-17 无锡中科泛在信息技术研发中心有限公司 Commodity review property word extracting method based on noumenon lexicon

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760142A (en) * 2011-04-29 2012-10-31 北京百度网讯科技有限公司 Method and device for extracting subject label in search result aiming at searching query
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN104268292A (en) * 2014-10-23 2015-01-07 广州索答信息科技有限公司 Label word library update method of portrait system
CN104715049A (en) * 2015-03-26 2015-06-17 无锡中科泛在信息技术研发中心有限公司 Commodity review property word extracting method based on noumenon lexicon

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353071A (en) * 2018-12-05 2020-06-30 阿里巴巴集团控股有限公司 Label generation method and device

Similar Documents

Publication Publication Date Title
Giachanou et al. Like it or not: A survey of twitter sentiment analysis methods
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
US10042896B2 (en) Providing search recommendation
JP2019533205A (en) User keyword extraction apparatus, method, and computer-readable storage medium
Bafna et al. Feature based summarization of customers’ reviews of online products
US20160299955A1 (en) Text mining system and tool
JP5615857B2 (en) Analysis apparatus, analysis method, and analysis program
US9311372B2 (en) Product record normalization system with efficient and scalable methods for discovering, validating, and using schema mappings
US20140379719A1 (en) System and method for tagging and searching documents
US20170109358A1 (en) Method and system of determining enterprise content specific taxonomies and surrogate tags
WO2014206151A1 (en) System and method for tagging and searching documents
CN109977233B (en) Idiom knowledge graph construction method and device
WO2015185020A1 (en) Information category obtaining method and apparatus
WO2014114175A1 (en) Method and apparatus for providing search engine tags
JP2015018559A (en) Method for enriching multimedia content, and corresponding device
Quan et al. Feature-level sentiment analysis by using comparative domain corpora
CN110019556B (en) Topic news acquisition method, device and equipment thereof
WO2018205459A1 (en) Target user acquisition method and apparatus, electronic device and medium
Araújo et al. Tensorcast: forecasting time-evolving networks with contextual information
WO2017071370A1 (en) Label processing method and device
Chawla et al. Performance evaluation of vsm and lsi models to determine bug reports similarity
Jayabharathy et al. Multi-document update summarisation using co-related terms for scientific articles and news group
KR101402339B1 (en) System and method of managing document
Pera et al. Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16858807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16858807

Country of ref document: EP

Kind code of ref document: A1