CN103955453B

CN103955453B - A kind of method and device for finding neologisms automatic from document sets

Info

Publication number: CN103955453B
Application number: CN201410220317.6A
Authority: CN
Inventors: 黄民烈; 朱小燕
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2014-05-23
Filing date: 2014-05-23
Publication date: 2017-09-29
Anticipated expiration: 2034-05-23
Also published as: CN103955453A

Abstract

The invention discloses a method and device for automatically discovering new words from a document collection, wherein, a template acquiring unit acquires one or more templates; Words matched by each template; the candidate template set adding unit selects at least a part of templates from the one or more templates to add to the candidate template set; the candidate word set adding unit extracts from the one or more templates Select at least a part of the words matched by each template and add them to the candidate word set; the new word set adding unit sorts the candidate words in the candidate word set based on the templates in the candidate template set, and sorts a certain number of candidate words based on the sorting. Words are added to the set of new words. Compared with the prior art, the method and device provided by the invention can effectively discover new words.

Description

A method and device for automatically discovering new words from a document collection

技术领域technical field

本发明涉及自然语言处理技术，尤其涉及一种从文档集中自动发现新词的方法及装置。The invention relates to natural language processing technology, in particular to a method and device for automatically discovering new words from a document collection.

背景技术Background technique

在社交网络中，网民喜欢用自己个性化的语言表达对政治、社会、文化等的看法。通常，个性化语言被越多的人传播越容易成为新的网络热词(简称“新词”)。目前，新词在自动文摘、文本聚类/分类、信息检索等方面有着很重要的应用，据统计，每年互联网上出现超过1000个的中文新词，这些新词大多为各个领域具有时效性的专业术语，由于这些新词大多不存在字典中，因而使得现有的分词算法很难将这些新词从文档集中识别。以情感类的新词“给力(形容词)”，文档“表演非常给力”为例，现有的分词算法通常对其进行如下分词：表演/名词非常/副词给/动词力/名词，从而使得新词“给力”不能作为一个完整的词进行切分，进行影响新词的识别。In social networks, netizens like to use their own personalized language to express their views on politics, society, culture, etc. Generally, the more people disseminate personalized language, the easier it is to become a new hot word on the Internet (referred to as "new word"). At present, new words have important applications in automatic summarization, text clustering/classification, information retrieval, etc. According to statistics, more than 1,000 new Chinese words appear on the Internet every year, and most of these new words are time-sensitive in various fields. Professional terms, since most of these new words do not exist in the dictionary, it is difficult for existing word segmentation algorithms to identify these new words from the document set. Taking the emotional new word "Give force (adjective)" and the document "The performance is very good" as an example, the existing word segmentation algorithm usually performs the following word segmentation: performance/noun very/adverb give/verb force/noun, so that the new The word "Gili" cannot be segmented as a complete word to affect the recognition of new words.

发明内容Contents of the invention

本发明解决的技术问题之一为提升新词识别的准确性。One of the technical problems solved by the present invention is to improve the accuracy of new word recognition.

根据本发明的一个方面的一个实施例，提供了一种从文档集中自动发现新词的方法，包括：According to an embodiment of an aspect of the present invention, a method for automatically discovering new words from a document collection is provided, including:

获取一个或多个模板；Get one or more templates;

从所述文档集中提取出与所述一个或多个模板中的各模板相匹配的词语；extracting words matching each of the one or more templates from the document set;

从所述一个或多个模板中至少选取一部分模板加入到候选模板集合；Selecting at least a part of the templates from the one or more templates and adding them to the set of candidate templates;

从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合；Selecting at least a part of words from the extracted words that match each template in the one or more templates and adding them to the candidate word set;

基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于用候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合。Based on the templates in the candidate template set, the candidate words in the candidate word set are sorted, and a certain number of candidate words are added to the new word set based on the templates in the candidate template set are used to sort the candidate words in the candidate word set. .

根据本发明的一个实施例，通过以下任一方式获取所述一个或多个模板：According to an embodiment of the present invention, the one or more templates are acquired in any of the following ways:

预先规定所述一个或多个模板，或pre-specifies said one or more templates, or

在获取文档集后，对所述文档集进行切词处理，从经过切词处理的文档集中提取出与特定正则表达式相匹配的所述一个或多个模板。After the document set is acquired, word segmentation processing is performed on the document set, and the one or more templates matching the specific regular expression are extracted from the word-segmented document set.

根据本发明的一个实施例，从所述一个或多个模板中至少选取一部分模板加入到候选模板集合的步骤包括以下中的任一个：According to an embodiment of the present invention, the step of selecting at least a part of templates from the one or more templates and adding them to the set of candidate templates includes any of the following:

将所述一个或多个模板全部加入候选模板集合；adding all the one or more templates to the set of candidate templates;

基于所述一个或多个模板的每个模板在所述文档集中出现的次数，将一部分模板加入候选模板集合。Based on the number of occurrences of each template of the one or more templates in the document set, some templates are added to the set of candidate templates.

根据本发明的一个实施例，基于所述一个或多个模板的每个模板在所述文档集中出现的次数将一部分模板加入候选模板集合的步骤包括：According to an embodiment of the present invention, the step of adding a part of templates to the set of candidate templates based on the number of occurrences of each template of the one or more templates in the document set includes:

将在所述文档集中出现的次数排在前f名的模板加入候选模板集合，f为正整数；或Adding the template that appears in the top f number of times in the document set to the set of candidate templates, where f is a positive integer; or

将在所述文档集中出现的次数超过特定阈值的模板加入候选模板集合。Templates whose occurrence times in the document set exceed a specific threshold are added to the set of candidate templates.

根据本发明的一个实施例，从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合的步骤包括以下中的任一个：According to an embodiment of the present invention, the step of selecting at least a part of words from the extracted words that match each template in the one or more templates and adding them to the candidate word set includes any of the following:

将所述匹配的词语全部加入到候选词集合；All the matched words are added to the candidate word set;

基于所述匹配的词语与各模板的匹配次数，将一部分词语加入候选词集合。Based on the matching times of the matched words and each template, some words are added to the candidate word set.

根据本发明的一个实施例，基于所述匹配的词语与各模板的匹配次数，将一部分词语加入候选词集合的步骤包括：According to one embodiment of the present invention, based on the matching times of the matched words and each template, the step of adding some words to the candidate word set includes:

将匹配的词语中与各模板的匹配次数排在前g名的词语加入候选词集合，g为正整数；或In the matching words, the words whose number of matches with each template ranks first g are added to the candidate word set, and g is a positive integer; or

将匹配的词语中与各模板的匹配次数超过特定阈值的词语加入候选词集合。Among the matched words, words whose matching times with each template exceed a certain threshold are added to the candidate word set.

根据本发明的一个实施例，本方法还包括：在基于候选模板集合中的模板对所述候选词集合中的候选词排序之前，用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述用预先规定的新词集合对候选模板集合中的模板的排序过滤候选模板集合。According to an embodiment of the present invention, the method further includes: prior to sorting the candidate words in the candidate word set based on the templates in the candidate template set, sorting the templates in the candidate template set with a pre-specified new word set , and filter the set of candidate templates based on the ordering of the templates in the set of candidate templates with the set of pre-specified new words.

根据本发明的一个实施例，本方法还包括：用得到的新词集合对候选模板集合中的模板进行排序，并基于所述用得到的新词集合对候选模板集合中的模板的排序过滤候选模板集合，并用过滤后的候选模板集合再次对所述候选词集合中的候选词排序并基于所述用过滤后的候选模板集合再次对所述候选词集合中的候选词的排序再次将一定数量的候选词加入到新词集合。According to an embodiment of the present invention, the method further includes: sorting the templates in the candidate template set by using the obtained new word set, and filtering candidate candidates based on the sorting of the templates in the candidate template set based on the obtained new word set. template set, and use the filtered candidate template set to sort the candidate words in the candidate word set again and reorder a certain number of candidate words in the candidate word set based on the filtered candidate template set. The candidate words are added to the new word set.

根据本发明的一个实施例，对候选模板集合中的模板进行排序是通过基于以下公式计算候选模板集合中的模板权重并根据所计算的模板权重对候选模板集合中的模板进行排序来进行的：According to an embodiment of the present invention, the templates in the candidate template set are sorted by calculating the template weights in the candidate template set based on the following formula and sorting the templates in the candidate template set according to the calculated template weights:

n_1i＝k_1i+k_3i，n_2i＝k_2i+k_4i，n _1i =k _1i +k _3i , n _2i =k _2i +k _4i ,

W表示新词集合，P表示候选模板集合，w_i表示新词集合W中的一个词，p_j表示候选模板集合P中的一个模板，k_1i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既含有w_i又含有p_j的匹配个数，k_2i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有w_i但不含有p_j的匹配个数，k_3i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有p_j但不含有w_i的匹配个数，k_4i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既不含有p_j也不含有w_i的匹配个数。W represents a new word set, P represents a candidate template set, w _i represents a word in the new word set W, p _j represents a template in the candidate template set P, and k _1i represents a template found from the document set that matches the one or the number of matchings of each template in the multiple templates containing both w _i and p _j , k _2i means that the matching of each template in the one or more templates found in the document set contains w _i but does not contain the number of matches of p _j , k _3i represents the number of matches that contain p _j but does not contain w _i in the matching of each template in the one or more templates found in the document set, k _4i represents the number of matches found in the document set that contain neither p _j nor w _i among the matches to each of the one or more templates.

根据本发明的一个实施例，所述基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于用候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合包括：According to an embodiment of the present invention, the templates in the candidate template set are used to sort the candidate words in the candidate word set, and the templates in the candidate template set are used to sort the candidate words in the candidate word set. A certain number of candidate words are added to the new word set including:

将与候选模板集合中的模板相匹配的次数排在前m名的候选词加入到新词集合，m为正整数；或The number of times that the templates in the candidate template set are matched with the first m candidate words is added to the new word set, and m is a positive integer; or

将与候选模板集合中的模板相匹配的次数超过特定阈值的候选词加入到新词集合。The candidate words whose matching times with the templates in the candidate template set exceed a certain threshold are added to the new word set.

根据本发明的一个实施例，在基于候选模板集合中的模板对所述候选词集合中的候选词排序的步骤中，According to an embodiment of the present invention, in the step of sorting the candidate words in the candidate word set based on the templates in the candidate template set,

按照LLR(w_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i)中的其中一个、或其中任意多个的乘积来计算候选词集合中的候选词的权重，并基于所计算的权重对候选词集合中的候选词排序；According to one of LLR( _wi ), E( _wi ), P( _wi ), EMI( _wi ), 1/NMED( _wi ), or the product of any of them to calculate the candidate word set the weights of the candidate words, and sort the candidate words in the candidate word set based on the calculated weight;

其中，w_i表示候选词集合W中的一个候选词，LLR(w_i)表示候选词w_i与候选模板集合中的模板的统计联系的紧密程度，E(w_i)表示候选词w_i的左信息熵，P(w_i)表示候选词w_i中的字联合成词的概率，EMI(w_i)和NMED(w_i)分别表示对候选词w_i的语意合成性的不同度量；Among them, w _i represents a candidate word in the candidate word set W, LLR( _wi ) represents the closeness of the statistical connection between the candidate word w _i and the template in the candidate template set, E( _wi ) represents the candidate word w _i Left information entropy, P(w _i ) represents the probability that the words in the candidate word w _i are combined to form a word, EMI( _wi ) and NMED( _wi ) respectively represent different measures of the semantic composition of the candidate word w _i ;

其中在基于候选模板集合中的模板对所述候选词集合中的候选词排序的步骤中，LLR(w_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i)分别通过如下计算得到：Wherein, in the step of sorting the candidate words in the candidate word set based on the templates in the candidate template set, LLR( _wi ), E( _wi ), P( _wi ), EMI( _wi ), 1/ NMED(w _i ) are calculated as follows:

n_1j＝k_1j+k_3j，n_2j＝k_2j+k_4j，n _1j =k _1j +k _3j , n _2j =k _2j +k _4j ,

其中W表示候选词集合，P表示候选模板集合，w_i表示W中的一个候选词，p_j表示候选模板集合P中的一个模板，k_1j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既含有w_i又含有p_j的匹配个数，k_2j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有w_i但不含有p_j的匹配个数，k_3j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有p_j但不含有w_i的匹配个数，k_4j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既不含有p_j也不含有w_i的匹配个数；Wherein W represents the set of candidate words, P represents the set of candidate templates, w _i represents a candidate word in W, p _j represents a template in the set of candidate templates P, and k _1j represents a template found from the document set that matches the one or The matching number of each template in multiple templates contains both w _i and p _j , k _2j means that the matching of each template found in the document set and the one or more templates contains w _i but does not contain the number of matches of p _j , k _3j represents the number of matches that contain p _j but does not contain w _i in the matching of each template in the one or more templates found in the document set, k _4j represents the number of matches that neither p _j nor w _i are found in the matching of each template in the one or more templates found in the document set;

其中L表示文档集中与候选词w_i左侧搭配出现过且与候选模板集合中的任一模板匹配的左侧词l_o集合，c(l_o)表示左侧词l_o与候选词w_i左侧搭配出现过且与候选模板集合中的任一模板匹配的的次数，N表示候选词w_i与候选模板集合中的模板的一同出现的总次数；Among them, L represents the left word l _o set that has appeared in the document set on the left side of the candidate word w _i and matches any template in the candidate template set, and c(l _o ) represents the left word l _o and the candidate word w _i The number of times that the left collocation has appeared and matched any template in the candidate template set, and N represents the total number of times that the candidate word w _i appears together with the templates in the candidate template set;

其中t_h表示候选词集合中的候选词w_i中的第h个字，n表示候选词w_i中所含的单个字的数量；Among them, t _h represents the hth word in the candidate word w _i in the candidate word set, and n represents the number of individual words contained in the candidate word w _i ;

all(t_h)表示候选词w_i中的第h个字在文档集中出现的次数，s(t_h)表示候选词w_i中的第h个字与任意字作为一个单独的词在文档集中出现的次数；all(t _h ) indicates the number of occurrences of the hth word in the candidate word w _i in the document set, s(t _h ) indicates that the hth word in the candidate word w _i and any word are used as a single word in the document set the number of occurrences;

其中S为文档集M中的总语段数，n表示候选词w_i所包含的字个数，F表示文档集中包含候选词w_i的语段数，F_h表示文档集中包含候选词w_i中第h个字的语段数；Among them, S is the total number of segments in the document set M, n indicates the number of words contained in the candidate word w _i , F indicates the number of segments in the document set containing the candidate word w _i , and F _h indicates that the document set contains the candidate word w _i . The number of segments of h words;

其中S为文档集M中的总语段数，μ(g)表示文档集M中包含候选词w_i所包含的所有字的语段的数目，(g)表示文档集M中包含候选词w_i所包含的所有字严格连续出现在一个语段的语段数。Where S is the total number of segments in the document set M, μ(g) represents the number of segments in the document set M that contain all the words contained in the candidate word w _i , (g) indicates the number of segments in which all the words contained in the candidate word w _i in the document set M appear strictly consecutively in one segment.

根据本发明的一个实施例，所述用得到的新词集合对候选模板集合中的模板进行排序，并基于所述用得到的新词集合对候选模板集合中的模板的排序过滤候选模板集合的步骤包括：According to an embodiment of the present invention, the templates in the candidate template set are sorted by the obtained new word set, and the candidate template set is filtered based on the sorting of the templates in the candidate template set by the used new word set Steps include:

将与候选模板集合中的模板相匹配的次数排在前r名的候选词从所述候选模板集合中过滤，得到过滤后的候选模板集合，r为正整数；或Filter the candidate words whose number of matching times with the templates in the candidate template set ranks in the top r from the candidate template set to obtain a filtered candidate template set, where r is a positive integer; or

将与候选模板集合中的模板相匹配的次数高于特定阈值的候选词从所述候选模板集合中过滤，得到过滤后的候选模板集合。Candidate words whose matching times with templates in the candidate template set are higher than a specific threshold are filtered from the candidate template set to obtain a filtered candidate template set.

根据本发明的另一方面的一个实施例，还提供了一种从文档集中自动发现新词的装置，包括：According to an embodiment of another aspect of the present invention, there is also provided a device for automatically discovering new words from a document collection, including:

模板获取单元，被配置为获取一个或多个模板；a template obtaining unit configured to obtain one or more templates;

词语提取单元，被配置为从所述文档集中提取出与所述一个或多个模板中的各模板相匹配的词语；a word extraction unit configured to extract words that match each of the one or more templates from the document set;

候选模板集合加入单元，被配置为从所述一个或多个模板中至少选取一部分模板加入到候选模板集合；A candidate template set adding unit configured to select at least a part of templates from the one or more templates and add them to the candidate template set;

候选词集合加入单元，被配置为从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合；The candidate word set adding unit is configured to select at least a part of words from the extracted words matching each template in the one or more templates to add to the candidate word set;

新词集合加入单元，被配置为基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于所述候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合。The new word set adding unit is configured to sort the candidate words in the candidate word set based on the templates in the candidate template set, and the sorting of the candidate words in the candidate word set based on the templates in the candidate template set will be A certain number of candidate words are added to the new word set.

根据本发明的一个实施例，所述模板获取单元通过以下任一方式获取所述一个或多个模板：According to an embodiment of the present invention, the template acquisition unit acquires the one or more templates in any of the following ways:

根据本发明的一个实施例，本装置还包括：According to one embodiment of the present invention, the device also includes:

候选模板集合过滤单元，被配置为在基于候选模板集合中的模板对所述候选词集合中的候选词排序之前，用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述用预先规定的新词集合对候选模板集合中的模板的排序过滤候选模板集合。The candidate template set filtering unit is configured to sort the templates in the candidate template set with a pre-specified new word set before sorting the candidate words in the candidate word set based on the templates in the candidate template set, and based on the The method is to sort and filter the candidate template set by using the pre-specified new word set to sort the templates in the candidate template set.

根据本发明的一个实施例，所述新词集合加入单元被配置为用过滤后的候选模板集合再次对所述候选词集合中的候选词排序并基于所述排序再次将一定数量的候选词加入到新词集合。According to an embodiment of the present invention, the new word set adding unit is configured to use the filtered candidate template set to sort the candidate words in the candidate word set again and add a certain number of candidate words based on the sorting. to a collection of new words.

与现有技术相比，本发明一个实施例提供的从文档集中自动发现新词的方法，相比于现有技术的方案，可以有效地、更准确地、无监督地发现新词。Compared with the prior art, the method for automatically discovering new words from a document set provided by an embodiment of the present invention can effectively, more accurately, and unsupervised discover new words than the solutions of the prior art.

并且，本发明一个实施例提供的从文档集中自动发现新词的方法，通过在获取文档集后，对所述文档集进行切词处理，从经过切词处理的文档集中提取出与特定正则表达式相匹配的所述一个或多个模板，使得获取的模板更规范，从而更准确地发现新词。Moreover, in the method for automatically discovering new words from a document set provided by an embodiment of the present invention, after the document set is obtained, the document set is subjected to word segmentation processing, and a specific regular expression is extracted from the word-cutting document set. The one or more templates that match the formula, so that the acquired templates are more standardized, so that new words can be found more accurately.

并且，本发明的一个实施例提供的从文档集中自动发现新词的方法，基于所述一个或多个模板的每个模板在所述文档集中出现的次数，将一部分模板加入候选模板集合，即通过对模板进行筛选，提升本发明的计算效率。Moreover, in the method for automatically discovering new words from a document set provided by an embodiment of the present invention, based on the number of times each template of the one or more templates appears in the document set, a part of the templates are added to the candidate template set, namely By screening the templates, the calculation efficiency of the present invention is improved.

并且，本发明的一个实施例提供的从文档集中自动发现新词的方法，通过将匹配的词语中与各模板的匹配次数排在前g名的词语加入候选词集合，g为正整数；或将匹配的词语中与各模板的匹配次数超过特定阈值的词语加入候选词集合，即通过筛选出特定的、更优的候选词，从而更准确地发现新词。Moreover, the method for automatically discovering new words from the document collection provided by one embodiment of the present invention is to add the words that rank the top g among the matching words with each template into the candidate word set, where g is a positive integer; or Among the matched words, the words whose matching times with each template exceed a certain threshold are added to the candidate word set, that is, by screening out specific and better candidate words, new words can be discovered more accurately.

并且，本发明的一个实施例提供的从文档集中自动发现新词的方法，在基于候选模板集合中的模板对所述候选词集合中的候选词排序之前，用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述用预先规定的新词集合对候选模板集合中的模板的排序过滤候选模板集合，即通过对模板进行筛选，可以得到与待发现的新词相匹配的更好的模板，从而更准确地发现新词。Moreover, in the method for automatically discovering new words from a document set provided by an embodiment of the present invention, before sorting the candidate words in the candidate word set based on the templates in the candidate template set, the candidate words are sorted with a predetermined new word set. The templates in the template set are sorted, and the candidate template set is filtered based on the sorting of the templates in the candidate template set with the pre-specified new word set, that is, by screening the templates, a new word that matches the new word to be discovered can be obtained. better templates for more accurate discovery of new words.

本领域普通技术人员将了解，虽然下面的详细说明将参考图示实施例、附图进行，但本发明并不仅限于这些实施例。而是，本发明的范围是广泛的，且意在仅通过后附的权利要求限定本发明的范围。Those of ordinary skill in the art will appreciate that although the following detailed description refers to the illustrated embodiments and accompanying drawings, the present invention is not limited to these embodiments. Rather, the scope of the invention is broad and it is intended that the scope of the invention be limited only by the appended claims.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1示出根据本发明一个实施例的从文档集中自动发现新词的方法流程图；FIG. 1 shows a flowchart of a method for automatically discovering new words from a document collection according to an embodiment of the present invention;

图2示出根据本发明的另一个实施例的从文档集中自动发现新词的装置的示意性框图；Fig. 2 shows a schematic block diagram of an apparatus for automatically discovering new words from a document collection according to another embodiment of the present invention;

附图中相同或相似的附图标记代表相同或相似的部件。The same or similar reference numerals in the drawings represent the same or similar components.

具体实施方式detailed description

下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below in conjunction with the accompanying drawings.

图1示出根据本发明一个实施例的从文档集中自动发现新词的方法流程图。根据本发明的一个实施例，该方法包括：Fig. 1 shows a flowchart of a method for automatically discovering new words from a document set according to an embodiment of the present invention. According to one embodiment of the invention, the method includes:

步骤S101，获取一个或多个模板。Step S101, acquiring one or more templates.

步骤S102，从所述文档集中提取出与所述一个或多个模板中的各模板相匹配的词语；Step S102, extracting words matching each template in the one or more templates from the document set;

步骤S103，从所述一个或多个模板中至少选取一部分模板加入到候选模板集合；Step S103, selecting at least a part of the templates from the one or more templates and adding them to the set of candidate templates;

步骤S104，从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合；Step S104, selecting at least a part of words from the extracted words that match each template in the one or more templates and adding them to the candidate word set;

步骤S105，基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于用候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合。Step S105, sorting the candidate words in the candidate word set based on the templates in the candidate template set, and adding a certain number of candidate words to the collection of new words.

其中，文档集可以指单个文档或多个文档的集合，当然，此处的文档集仅为示例，还可以为其他的语料资源，诸如词典、微博数据库等，同样适用于本发明。Wherein, the document set may refer to a single document or a collection of multiple documents. Of course, the document set here is only an example, and may also be other corpus resources, such as dictionaries, microblog databases, etc., which are also applicable to the present invention.

具体地，在步骤S101中，可以通过以下任一方式获取一个或多个模板：Specifically, in step S101, one or more templates may be acquired in any of the following ways:

方式一：预先规定所述一个或多个模板，例如，通过诸如副词、助词等词性定义模板，从而枚举出一个或多个模板，或者所述一个或多个模板也可以预先设定或给定；或者，Way 1: Predetermine the one or more templates, for example, define the templates by part of speech such as adverbs, particles, etc., so as to enumerate one or more templates, or the one or more templates can also be preset or given set; or,

方式二：在获取文档集后，对所述文档集进行切词处理，从经过切词处理的文档集中提取出与特定匹配规则相匹配的所述一个或多个模板。此处的切词处理方法具体不作限定，诸如基于字符串匹配的切词方法、基于理解的切词方法、基于统计的切词方法等，都可适用于本发明，并在此以引用方式包含于此。其中的特定匹配规则包括诸如基于正则表达式进行正则匹配的规则，在此不作限定，还可以包括其他匹配规则。此处的正则表达式主要指一个或多个特定词性组合而成的表达式，诸如“名词+形容词”、“副词+空格/*+叹词”为两个不同的正则表达式，此处的空格/*代表任意其他内容和词性。优选地，在本实施例中，如果采用方式二获取一个或多个模板，则根据所要发现的新词的词性，定义不同的正则表达式。例如，本实施例的目的在于发现情感类的词，则主要定义的正则表达式有“副词+空格/*+叹词”、“副词+空格/*”等。Method 2: After the document set is acquired, word segmentation is performed on the document set, and the one or more templates matching the specific matching rule are extracted from the word-segmented document set. The word segmentation processing method here is specifically not limited, such as the word segmentation method based on string matching, the word segmentation method based on understanding, the word segmentation method based on statistics, etc., are all applicable to the present invention, and are included here by reference here. The specific matching rules include rules such as regular matching based on regular expressions, which are not limited here and may also include other matching rules. The regular expression here mainly refers to an expression composed of one or more specific parts of speech, such as "noun+adjective", "adverb+space/*+interjection" are two different regular expressions, here A space/* stands for any other content and part of speech. Preferably, in this embodiment, if one or more templates are obtained in the second manner, different regular expressions are defined according to the part of speech of the new word to be discovered. For example, the purpose of this embodiment is to find emotional words, then the main defined regular expressions include "adverb+space/*+interjection", "adverb+space/*" and so on.

其中，所述模板主要指词和空格/自定义符号的组合，例如，“太+/*+了”、“好++啊”为两个不同的模板，此处的空格/自定义符号代表任意个词的组合。以枚举一个模板举例，通过“副词+空格+叹词”定义一个模板如下：“真++啊”。Wherein, the template mainly refers to the combination of words and spaces/custom symbols, for example, "too +/*+" and "good ++" are two different templates, and the spaces/custom symbols here represent any combination of words. Take enumerating a template as an example, define a template through "adverb + space + interjection" as follows: "true++ah".

在步骤S102中，从所述文档集中提取出与所述一个或多个模板中的各模板相匹配的词语。具体地，以文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，好漂亮啊，非常坑爹啊”，多个模板“太+/*+了”、“好++啊”为例，文档集中的词语“给力”、“坑爹”与模板“太+/*+了”相匹配，文档集中的词语“给力”、“坑爹”、“漂亮”与模板“好++啊”相匹配，由此，从该文档集提取的词语包括：“给力”、“坑爹”、“漂亮”。当然，本实施例对于词语从文档集中提取的方式不作限定，在此以引用的方式包含于此。In step S102, words matching each of the one or more templates are extracted from the document set. Specifically, the document set is "too powerful, too cheating, so awesome, so cheating, so beautiful, and very cheating", multiple templates are "too +/*+", "good ++" For example, the words "Great" and "Crap" in the document set match the template "too +/*+", and the words "Great", "Crap" and "Pretty" in the document set match the template "Good++" Matching, thus, the words extracted from the document set include: "Great", "Crap", and "Pretty". Of course, this embodiment does not limit the manner in which words are extracted from the document set, which is incorporated herein by reference.

在步骤S103中，从所述一个或多个模板中至少选取一部分模板加入到候选模板集合。具体地，可以从所述一个或多个模板中选取一部分模板加入到候选模板集合，也可以将所述一个或多个模板全部加入候选模板集合。In step S103, at least a part of templates are selected from the one or more templates and added to the candidate template set. Specifically, a part of templates may be selected from the one or more templates and added to the candidate template set, or all of the one or more templates may be added to the candidate template set.

可选地，基于所述一个或多个模板的每个模板在所述文档集中出现的次数，将一部分模板加入候选模板集合。具体地，先统计所述一个或多个模板的每个模板在所述文档集中出现的次数，然后按照一定的规则对所述一个或多个模板进行筛选，将筛选后的模板加入候选模板集合。仍以文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，好漂亮啊，非常坑爹啊”，多个模板“太+/*+了”、“好++啊”为例，由于模板“太+/*+了”与文档集中的“太给力了，太坑爹了”相匹配，由此，统计得到该模板“太+/*+了”在文档集中出现的次数为2，同理，统计得到模板“好++啊”在文档集中出现的次数为3，进一步按照一定的规则对所述一个或多个模板进行筛选，将筛选后的模板加入候选模板集合。Optionally, based on the number of occurrences of each template of the one or more templates in the document set, some templates are added to the set of candidate templates. Specifically, first count the number of occurrences of each template of the one or more templates in the document set, then filter the one or more templates according to certain rules, and add the filtered templates to the candidate template set . Still take the document set "too awesome, too cheating, so awesome, so cheating, so beautiful, very cheating", multiple templates "too +/*+", "good ++" as an example , because the template "too +/*+" matches "too awesome, too bad" in the document set, and thus, the number of occurrences of the template "too +/*+" in the document set is 2 , similarly, the number of occurrences of the template "good++" in the document set is 3 according to the statistics, and the one or more templates are further screened according to certain rules, and the screened templates are added to the candidate template set.

可选地，基于所述一个或多个模板的每个模板在所述文档集中出现的次数将一部分模板加入候选模板集合的步骤包括：Optionally, the step of adding a part of templates to the set of candidate templates based on the number of occurrences of each template of the one or more templates in the document set includes:

举例而言，仍以上述文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，好漂亮啊，非常坑爹啊”，多个模板“太+/*+了”、“好++啊”为例，统计得到模板“太+/*+了”在文档集中出现的次数为2，模板“好++啊”在文档集中出现的次数为3之后，如果将在所述文档集中出现的次数排在第1名的模板加入候选模板集合，则过滤模板“太+/*+了”，并将模板“好++啊”加入候选模板集合；如果将在所述文档集中出现的次数超过1次的模板加入候选模板集合，则同时将模板“太+/*+了”、“好++啊”加入候选模板集合。For example, the above-mentioned document set is still "too powerful, too cheating, so awesome, so cheating, so beautiful, very cheating", multiple templates "too +/*+", "good+ +ah” as an example, the number of occurrences of the template “too +/*+” in the document set is 2, and the number of occurrences of the template “ok++” in the document set is 3, if the template will be included in the document set The template with the number of occurrences ranked first is added to the candidate template set, then the template "too +/*+" is filtered, and the template "good++" is added to the candidate template set; if the template that will appear in the document set Templates with more than one times are added to the candidate template set, and the templates "too +/*+" and "good ++" are added to the candidate template set at the same time.

在步骤S104中，从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合。具体地，可以提取出的与所述一个或多个模板中的各模板相匹配的词语中选取一部分词语加入到候选词集合，也可以将所提取出的与所述一个或多个模板中的各模板相匹配的词语全部加入到候选词集合。In step S104, at least a part of the extracted words matching each template in the one or more templates is selected and added to the candidate word set. Specifically, a part of the extracted words that match each template in the one or more templates can be selected and added to the candidate word set, or the extracted words that match the one or more templates All the words matched by each template are added to the candidate word set.

可选地，基于所述匹配的词语与各模板的匹配次数，将一部分词语加入候选词集合。具体地，先统计所述匹配的词语与各模板的匹配次数，然后按照一定的规则对所述匹配的词语进行筛选，将筛选后的词语加入候选词集合。仍以上述文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，好漂亮啊，非常坑爹啊”，多个模板“太+/*+了”、“好++啊”为例，统计得到词语“给力”与模板“太+/*+了”和模板“好++啊”匹配的次数各为1，词语“坑爹”与模板“太+/*+了”和模板“好++啊”匹配的次数也各为1，词语“漂亮”与模板“太+/*+了”和模板“好++啊”匹配的次数分别为0和1，进一步按照一定的规则对所述相匹配的词语进行筛选，将筛选后的词语取并集后加入候选词集合。Optionally, based on the number of matches between the matched words and each template, some words are added to the candidate word set. Specifically, the matching times of the matched words and each template are counted first, and then the matched words are screened according to certain rules, and the screened words are added to the candidate word set. Still take the above document set as "too powerful, too cheating, so awesome, so cheating, so beautiful, very cheating", multiple templates "too +/*+", "good ++" as For example, the number of matches between the word "to force" and the template "too+/*+" and the template "good++" is 1 each, and the word "bad dad" is matched with the template "too+/*+" and the template " The matching times of "good ++ ah" are also 1, and the matching times of the word "beautiful" and the template "too +/*+" and the template "good ++ ah" are 0 and 1 respectively, and further according to certain rules The matching words are screened, and the filtered words are combined and added to the candidate word set.

可选地，基于所述匹配的词语与各模板的匹配次数，将一部分词语加入候选词集合的步骤包括：Optionally, based on the matching times of the matched words and each template, the step of adding some words to the candidate word set includes:

举例而言，仍以上述文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，好漂亮啊，非常坑爹啊”，多个模板“太+/*+了”、“好++啊”为例，统计得到各词语与各模板的匹配次数后，如果将与模板“太+/*+了”相匹配次数排在第一名的词语或与模板“好++啊”相匹配次数排在第一名加入候选词集合，则将“给力”、“坑爹”和“漂亮”加入到候选词集合；如果将匹配的词语中与模板“太+/*+了”或模板“好++啊”的匹配次数超过1次的词语加入候选词集合，则将词语“给力”、“坑爹”和“漂亮”取并集后加入到候选词集合。For example, the above-mentioned document set is still "too powerful, too cheating, so awesome, so cheating, so beautiful, very cheating", multiple templates "too +/*+", "good+ +ah” as an example, after counting the number of matches between each word and each template, if the word that matches the template “too +/*+” ranks first or matches the template “好++ah” If the number of matches ranks first and is added to the candidate word set, then "Great", "Crap" and "Pretty" are added to the candidate word set; if the matched words are matched with the template "too +/*+" or the template " Words with more than 1 matching number of "good ++ ah" are added to the candidate word set, and the words "giant", "cheating" and "beautiful" are combined and added to the candidate word set.

在步骤S105中，基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于用候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合。In step S105, the candidate words in the candidate word set are sorted based on the templates in the candidate template set, and a certain number of candidate words are sorted based on the templates in the candidate template set to the candidate words in the candidate word set. added to the collection of new words.

具体地，以候选模板集合{“太+/*+了”，“好++啊”}，候选词集合{“给力”、“坑爹”}为例，可以基于候选词集合中的各词与候选模板集合中的各模板相匹配的次数对各候选词进行排序，例如，候选词“给力”与模板“太+/*+了”匹配的次数为5，与模板“好++啊”匹配的次数为4；候选词“坑爹”与模板“太+/*+了”匹配的次数为3，与模板“好++啊”匹配的次数为5，则既可以按照各候选词与候选模板集合中的所有模板匹配的总次数进行排序，例如候选词“给力”的匹配总次数为9，候选词“坑爹”的匹配总次数为8；也可以按照各候选词与候选模板集合中的各模板分别匹配的次数进行分别排序，当然还可以基于其他方式或算法对候选词进行排序，在此不作限定。Specifically, taking the candidate template set {"too +/*+", "good ++"}, and the candidate word set {"力力", "坐父"} as examples, it can be based on the relationship between each word in the candidate word set and The number of times each template in the candidate template set matches each candidate word is sorted. For example, the number of times the candidate word "Gili" matches the template "太+/*+" is 5, and the number of times it matches the template "好++ah" The number of times is 4; the number of times that the candidate word "坑爹" matches the template "too+/*+" is 3, and the number of times that the template "good ++" matches is 5, then both the candidate words and the candidate template can be The total number of matching times of all templates in the set is sorted. For example, the total number of matching times of the candidate word "Geeli" is 9, and the total number of matching times of the candidate word "Kengbai" is 8; The matching times of the templates are respectively sorted. Of course, the candidate words can also be sorted based on other methods or algorithms, which are not limited here.

可选地，所述基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于用候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合包括：Optionally, sorting the candidate words in the candidate word set based on the templates in the candidate template set, sorting a certain number of candidate words based on the templates in the candidate template set to the candidate words in the candidate word set Words added to the new word set include:

可选地，在基于候选模板集合中的模板对所述候选词集合中的候选词排序的步骤中，Optionally, in the step of sorting the candidate words in the candidate word set based on the templates in the candidate template set,

按照LLR(w_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i)中的其中一个、或其中任意多个的乘积来计算候选词集合中的候选词的权重weight(w_i)，例如，按照LLR(w_i)、LLR(w_i)*E(w_i)、LLR(w_i)*E(w_i)*P(w_i)、LLR(w_i)*E(w_i)*EMI(w_i)或LLR(w_i)*E(w_i)/NMED(w_i)等计算候选词集合中的候选词的权重，并基于所计算的权重对候选词集合中的候选词排序；当然，对权重weight(w_i)的计算并不限于上述计算方式，任何可以评估候选词与候选模板集合中的模板的联系紧密程度或/和候选词内聚度度量的计算方式都可以适用于本发明，并在此以引用的方式包含于此。According to one of LLR( _wi ), E( _wi ), P( _wi ), EMI( _wi ), 1/NMED( _wi ), or the product of any of them to calculate the candidate word set The weight weight( _wi ) of the candidate word, for example, according to LLR( _wi ), LLR( _wi )*E( _wi ), LLR( _wi )*E( _wi )*P( _wi ), LLR( _wi )*E( _wi )*EMI( _wi ) or LLR( _wi )*E( _wi )/NMED( _wi ) calculates the weight of the candidate words in the candidate word set, and based on the The calculated weight ranks the candidate words in the candidate word set; of course, the calculation of the weight weight(w _i ) is not limited to the above-mentioned calculation method, any word that can evaluate the closeness or/and The calculation methods of the cohesion measure of the candidate words are all applicable to the present invention, and are hereby incorporated by reference.

此处的w_i表示候选词集合W中的一个候选词，LLR(w_i)表示候选词w_i与候选模板集合中的模板的联系紧密程度，E(w_i)表示候选词w_i的左信息熵，P(w_i)表示候选词w_i中的字联合成词的概率，EMI(w_i)和NMED(w_i)分别表示对候选词w_i的语意合成性的不同度量。Here, w _i represents a candidate word in the candidate word set W, LLR( _wi ) represents the closeness of the candidate word w _i to the template in the candidate template set, E( _wi ) represents the left side of the candidate word w _i Information entropy, P(w _i ) represents the probability that the words in the candidate word w _i combine to form a word, and EMI( _wi ) and NMED( _wi ) represent different measures of the semantic synthesis of the candidate word w _i respectively.

其中，在基于候选模板集合中的模板对所述候选词集合中的候选词排序的步骤中，LLR(w_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i)可以分别通过如下计算得到：Wherein, in the step of sorting the candidate words in the candidate word set based on the templates in the candidate template set, LLR( _wi ), E( _wi ), P( _wi ), EMI( _wi ), 1 /NMED(w _i ) can be calculated as follows:

n_1j＝k_1j+k_3j，n₂j＝k_2j+k_4j，n _1j =k _1j +k _3j , n ₂ j =k _2j +k _4j ,

其中W表示候选词集合，P表示候选模板集合，w_i表示W中的一个候选词，p_j表示候选模板集合P中的一个模板，k_1j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既含有w_i又含有p_j的匹配个数，k_2j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有w_i但不含有p_j的匹配个数，k_3j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有p_j但不含有w_i的匹配个数，k_4j表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既不含有p_j也不含有w_i的匹配个数；以文档集“太给力了，太坑爹了，好给力啊，好坑爹啊，非常坑爹啊”，候选词“给力”、候选模板集合P“太+/*+了”中的模板p₁“太+/*+了”为例，计算得到：k₁₁＝1，k₂₁＝1，k₃₁＝1，k₄₁＝2；Wherein W represents the set of candidate words, P represents the set of candidate templates, w _i represents a candidate word in W, p _j represents a template in the set of candidate templates P, and k _1j represents a template found from the document set that matches the one or The matching number of each template in multiple templates contains both w _i and p _j , k _2j means that the matching of each template found in the document set and the one or more templates contains w _i but does not contain the number of matches of p _j , k _3j represents the number of matches that contain p _j but does not contain w _i in the matching of each template in the one or more templates found in the document set, k _4j represents the number of matches that neither p _j nor w _i are found in the matching of each template in the one or more templates found in the document set; , so awesome, so cheating, so cheating", the candidate word "Great", the template p ₁ "too +/*+" in the candidate template set P "too +/*+" as an example, the calculation can be obtained : k ₁₁ =1, k ₂₁ =1, k ₃₁ =1, k ₄₁ =2;

其中，E(w_i)的计算如下：Among them, the calculation of E(w _i ) is as follows:

其中L表示文档集中与候选词w_i左侧搭配出现过且与候选模板集合中的任一模板匹配的左侧词l_o集合，c(l_o)表示左侧词l_o与候选词w_i左侧搭配出现过且与候选模板集合中的任一模板匹配的的次数，N表示候选词w_i与候选模板集合中的模板的一同出现的总次数；通常左信息熵E(w_i)越大，表示候选词和候选模板集合中的左侧词搭配的多样性越高，也即表示候选词成为新词的可能性越大；例如，以文档集{“好给力啊”、“太给力啦”、“太给力了”、“好给力呀”、“真给力呀”}、候选词“给力”，候选模板集合{“好++啊”、“太++啦”、“太++了”、“好++呀”、“真++呀”}为例，则左侧词l_o集合为{好，太，真}，c(l₁)＝2，c(l₂)＝2，c(l₃)＝1，E(w_i)的计算如下：Among them, L represents the left word l _o set that has appeared in the document set on the left side of the candidate word w _i and matches any template in the candidate template set, and c(l _o ) represents the left word l _o and the candidate word w _i The number of times that the left collocation has appeared and matched with any template in the candidate template set, and N represents the total number of times that the candidate word w _i appears together with the templates in the candidate template set; usually the left information entropy E(w _i ) is higher The larger the value, the higher the diversity of collocations between the candidate words and the left-hand words in the candidate template set, which means that the candidate words are more likely to become new words; La", "It's too powerful", "It's so awesome", "It's really awesome"}, the candidate word "Great", the candidate template collection {"Good ++", "Too ++", "Too ++ ", "Good ++", "True ++"} as an example, then the set of words l _o on the left side is {good, too, true}, c(l ₁ )=2, c(l ₂ )= 2, c(l ₃ )=1, the calculation of E(w _i ) is as follows:

E(w_i)＝-2/5log(2/5)-2/5log(2/5)-1/5log(1/5)；E(w _i )=-2/5log(2/5)-2/5log(2/5)-1/5log(1/5);

其中，对于候选词集合中的候选词w_i，其P(w_i)的计算如下：Among them, for the candidate word w _i in the candidate word set, the calculation of its P(w _i ) is as follows:

all(t_h)表示候选词w_i中的第h个字在文档集中出现的次数，s(t_h)表示候选词w_i中的第h个字与任意字作为一个单独的词在文档集中出现的次数；以词“爱说”为例，“爱”为词“爱说”中的第一个字，“说”为词“爱说”中的第二个字，经过统计，得到“爱”在文档集中出现了100次，“说”在文档集中出现了200次，“爱”和文档集中其他字搭配成单独的词出现了50次，“说”和文档集中其他字搭配成单独的词出现了150次，由此，计算第一个字“爱”成词的概率为：P(爱)＝(100-50)/100，第二个字“说”成词的概率为：P(说)＝(200-150)/200，候选词“爱说”单独成词的概率为：P(爱说)＝(P(爱)/(1-P(爱)))*(P(说)/(1-P(说)))；all(t _h ) indicates the number of occurrences of the hth word in the candidate word w _i in the document set, s(t _h ) indicates that the hth word in the candidate word w _i and any word are used as a single word in the document set The number of occurrences; Take the word "love to say" as an example, "love" is the first word in the word "love to say", and "say" is the second word in the word "love to say", after statistics, get "Love" appears 100 times in the document set, "say" appears 200 times in the document set, "love" and other words in the document set appear 50 times as a single word, "say" and other words in the document set form a single word The word appeared 150 times, thus, the probability of calculating the first word "love" into a word is: P(love)=(100-50)/100, and the probability of the second word "say" into a word is: P (say)=(200-150)/200, the probability that the candidate word " love to say " becomes a word separately is: P (love to say)=(P (love)/(1-P (love)))*(P (say)/(1-P(say)));

其中，对于候选词集合中的候选词w_i，其语意合成性的度量EMI的计算如下：Among them, for the candidate word w _i in the candidate word set, the calculation of its semantic composition measure EMI is as follows:

其中S为文档集M中的总语段数，n表示候选词w_i所包含的字个数，F表示文档集中包含候选词w_i的语段数，F_h表示文档集中包含候选词w_i中第h个字的语段数。此处所述的语段可以定义为文档集M中每篇文档的自然语段，也可以定义为文档集M中的每篇文档，在此不作限定，例如，文档集M由多条微博组成，则每条微博可以视为一个语段。以候选词“给力”为例，文档集M中的总语段数为200，候选词“给力”在文档集M中的语段数为10，字“给”在文档集M中的语段数为20，字“力”在文档集M中的语段数为30，则S＝200，n＝2，F＝10，F₁＝20，F₂＝30，并计算得到：Where S is the total number of segments in the document set M, n indicates the number of words contained in the candidate word w _i , F indicates the number of segments in the document set containing the candidate word w _i , and F _h indicates that the document set contains the candidate word w _i in the first The number of segments of h words. The segment mentioned here can be defined as the natural segment of each document in the document set M, and can also be defined as each document in the document set M, which is not limited here. For example, the document set M consists of multiple microblogs composition, each microblog can be regarded as a segment. Taking the candidate word "Geili" as an example, the total number of segments in the document set M is 200, the number of segments for the candidate word "Geili" in the document set M is 10, and the number of segments for the word "Gei" in the document set M is 20 , the number of segments of the word "force" in the document set M is 30, then S=200, n=2, F=10, F ₁ =20, F ₂ =30, and the calculation is:

其中，对于候选词集合中的候选词w_i，其语意合成性的度量NMED的计算如下：Among them, for the candidate word w _i in the candidate word set, its semantic composition measure NMED is calculated as follows:

其中S为文档集M中的总语段数，μ(g)表示文档集M中包含候选词w_i所包含的所有字的语段的数目，(g)表示文档集M中包含候选词w_i所包含的所有字严格连续出现在一个语段的语段数。通常NMED(w_i)的值越小，表明候选词w_i中的字与候选词w_i中以外的字成词的概率越大。就候选词w_i中所有字分散出现而言，例如以候选词“给力”为例，其出现在文档集中的一个语段的情形可以为：“忙了一天什么力气都没有，真是给跪了。”；其严格连续出现在文档集中的一个语段的情形可以为：“他们的表演实在是太给力了！”。Where S is the total number of segments in the document set M, μ(g) represents the number of segments in the document set M that contain all the words contained in the candidate word w _i , (g) indicates the number of segments in which all the words contained in the candidate word w _i in the document set M appear strictly consecutively in one segment. Usually, the smaller the value of NMED(w _i ), the greater the probability that the words in the candidate word w _i and the characters other than the candidate word w _i form words. As far as all the characters in the candidate word _w appear scattered, for example, taking the candidate word "Gili" as an example, the situation where it appears in a segment of the document collection can be: "I have been busy for a day and have no strength at all. I really knelt down. .”; a situation where it appears strictly consecutively in a sentence segment in the document collection can be: “Their performance is really awesome!”.

可选地，在基于候选模板集合中的模板对所述候选词集合中的候选词排序之前，用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述用预先规定的新词集合对候选模板集合中的模板的排序过滤候选模板集合。其中，新词集合中的词可以为预先规定的任意数量的任意词。为了提高新词发现的准确率，通常选择在新词发现前已经被用户认为是较新或较流行的词加入到新词集合。Optionally, before sorting the candidate words in the candidate word set based on the templates in the candidate template set, sort the templates in the candidate template set with a predefined new word set, and sort the templates in the candidate template set based on the predefined new word set The set of new words filters the set of candidate templates by sorting the templates in the set of candidate templates. Wherein, the words in the new word set can be any number of any words specified in advance. In order to improve the accuracy of new word discovery, words that have been considered newer or more popular by users before new word discovery are usually selected to be added to the new word set.

可选地，所述用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述排序过滤候选模板集合的步骤包括：Optionally, the step of sorting the templates in the set of candidate templates with a pre-specified set of new words, and filtering the set of candidate templates based on the sorting includes:

将与新词集合中的新词相匹配的次数排在前r名的候选模板从所述候选模板集合中过滤，得到过滤后的候选模板集合，r为正整数；或Filtering the candidate templates whose number of times matching the new words in the new word set ranks the top r from the candidate template set to obtain the filtered candidate template set, r is a positive integer; or

将与新词集合中的新词相匹配的次数高于特定阈值的候选模板从所述候选模板集合中过滤，得到过滤后的候选模板集合。The candidate templates whose matching times with the new words in the new word set are higher than a specific threshold are filtered from the candidate template set to obtain a filtered candidate template set.

可选地，本方法还包括：用得到的新词集合对候选模板集合中的模板进行排序，并基于所述用得到的新词集合对候选模板集合中的模板的排序过滤候选模板集合，并用过滤后的候选模板集合再次对所述候选词集合中的候选词排序并基于所述用过滤后的候选模板集合再次对所述候选词集合中的候选词的排序再次将一定数量的候选词加入到新词集合。此处的“再次”并不对次数进行限制，可以根据设定的条件对次数进行调整。需要注意的是，此处“得到的新词集合”是指通过本发明所提供的从文档集中自动发现新词的方法对“预先规定的新词集合”进行处理后而得到的。Optionally, the method further includes: sorting the templates in the candidate template set with the obtained new word set, and filtering the candidate template set based on the sorting of the templates in the candidate template set based on the obtained new word set, and using The candidate template set after filtering sorts the candidate words in the candidate word set again and adds a certain number of candidate words based on the sorting of the candidate words in the candidate word set again based on the filtered candidate template set to a collection of new words. The "again" here does not limit the number of times, and the number of times can be adjusted according to the set conditions. It should be noted that the "obtained new word set" here refers to the "pre-specified new word set" obtained by processing the "pre-specified new word set" through the method for automatically discovering new words from a document set provided by the present invention.

可选地，所述用得到的新词集合对候选模板集合中的模板进行排序，并基于所述用得到的新词集合对候选模板集合中的模板的排序过滤候选模板集合的步骤包括：Optionally, the step of using the obtained new word set to sort the templates in the candidate template set, and filtering the candidate template set based on the used new word set to sort the templates in the candidate template set includes:

可选地，用得到的新词集合对候选模板集合中的模板进行排序是通过基于以下公式计算候选模板集合中的模板权重并根据所计算的模板权重对候选模板集合中的模板进行排序来进行的：Optionally, using the obtained new word set to sort the templates in the candidate template set by calculating the template weights in the candidate template set based on the following formula and sorting the templates in the candidate template set according to the calculated template weights of:

其中W表示新词集合，P表示候选模板集合，w_i表示新词集合W中的一个词，p_j表示候选模板集合P中的一个模板，k_1i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既含有w_i又含有p_j的匹配个数，k_2i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有w_i但不含有p_j的匹配个数，k_3i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中含有p_j但不含有w_i的匹配个数，k_4i表示从所述文档集中发现的与所述一个或多个模板中的各模板的匹配中既不含有p_j也不含有w_i的匹配个数。Where W represents a new word set, P represents a candidate template set, _wi represents a word in the new word set W, p _j represents a template in the candidate template set P, and k _1i represents a template found from the document set that is similar to the The matching number of each template in one or more templates contains both w _i and p _j , and k _2i represents the number of matches found from the document set and each template in the one or more templates Contains w _i but does not contain the number of matches of p _j , k _3i represents the number of matches that contain p _j but does not contain w _i found in the matching of each template in the one or more templates found in the document set , k _4i represents the number of matches found in the document set that contain neither p _j nor w _i among the matches to each of the one or more templates.

图2示出根据本发明的另一个实施例的从文档集中自动发现新词的装置的示意性框图。根据本发明的另一个实施例，从文档集中自动发现新词的装置包括：Fig. 2 shows a schematic block diagram of an apparatus for automatically discovering new words from a document collection according to another embodiment of the present invention. According to another embodiment of the present invention, the device for automatically discovering new words from a document collection includes:

模板获取单元201，被配置为获取一个或多个模板；a template acquiring unit 201 configured to acquire one or more templates;

词语提取单元202，被配置为从所述文档集中提取出与所述一个或多个模板中的各模板相匹配的词语；a word extraction unit 202 configured to extract words matching each template in the one or more templates from the document set;

候选模板集合加入单元203，被配置为从所述一个或多个模板中至少选取一部分模板加入到候选模板集合；The candidate template set adding unit 203 is configured to select at least a part of templates from the one or more templates and add them to the candidate template set;

候选词集合加入单元204，被配置为从提取出的与所述一个或多个模板中的各模板相匹配的词语中至少选取一部分词语加入到候选词集合；The candidate word set adding unit 204 is configured to select at least a part of words from the extracted words that match each template in the one or more templates and add them to the candidate word set;

新词集合加入单元205，被配置为基于候选模板集合中的模板对所述候选词集合中的候选词排序，基于所述候选模板集合中的模板对所述候选词集合中的候选词的排序将一定数量的候选词加入到新词集合。The new word set adding unit 205 is configured to sort the candidate words in the candidate word set based on the templates in the candidate template set, and to sort the candidate words in the candidate word set based on the templates in the candidate template set Add a certain number of candidate words to the new word set.

应当理解，图2所示的框图仅仅是为了示例的目的，而不是对本发明范围的限制。在某些情况下，可以根据具体情况增加或减少某些单元或装置。It should be understood that the block diagram shown in FIG. 2 is for the purpose of illustration only, rather than limiting the scope of the present invention. In some cases, some units or devices may be added or subtracted according to the specific situation.

可选地，模板获取单元201通过以下任一方式获取所述一个或多个模板：Optionally, the template acquiring unit 201 acquires the one or more templates in any of the following ways:

可选地，所述从文档集中自动发现新词的装置还包括：Optionally, the device for automatically discovering new words from the document collection also includes:

候选模板集合过滤单元206，被配置为在基于候选模板集合中的模板对所述候选词集合中的候选词排序之前，用预先规定的新词集合对候选模板集合中的模板进行排序，并基于所述用预先规定的新词集合对候选模板集合中的模板的排序过滤候选模板集合。The candidate template set filtering unit 206 is configured to sort the templates in the candidate template set with a pre-specified new word set before sorting the candidate words in the candidate word set based on the templates in the candidate template set, and based on The sorting of the templates in the candidate template set by using the pre-specified new word set filters the candidate template set.

可选地，所述新词集合加入单元205被配置为用过滤后的候选模板集合再次对所述候选词集合中的候选词排序并基于所述排序再次将一定数量的候选词加入到新词集合。Optionally, the new word set adding unit 205 is configured to use the filtered candidate template set to sort the candidate words in the candidate word set again and add a certain number of candidate words to the new word based on the sorting. gather.

所属技术领域的技术人员知道，本发明可以实现为设备、装置、方法或计算机程序产品。因此，本公开可以具体实现为以下形式，即：可以是完全的硬件，也可以是完全的软件，还可以是硬件和软件结合的形式。Those skilled in the art know that the present invention can be realized as a device, an apparatus, a method or a computer program product. Therefore, the present disclosure can be specifically implemented in the following forms, namely: it can be completely hardware, it can be completely software, and it can also be a combination of hardware and software.

附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more Executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神或基本特征的情况下，能够以其他的具体形式实现本发明。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It will be apparent to those skilled in the art that the invention is not limited to the details of the above-described exemplary embodiments, but that the invention can be embodied in other specific forms without departing from the spirit or essential characteristics of the invention. Accordingly, the embodiments should be regarded in all points of view as exemplary and not restrictive, the scope of the invention being defined by the appended claims rather than the foregoing description, and it is therefore intended that the scope of the invention be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalents of the elements are embraced in the present invention. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method of automatically discovering new words from a document collection, comprising:

acquiring one or more templates (S101), wherein the templates comprise words and spaces or/and custom symbols;

extracting words matched with each template in the one or more templates from the document set (S102), wherein the words matched with each template in the one or more templates extracted from the document set are words except for words included in the template;

selecting at least one part of templates from the one or more templates and adding the selected part of templates into a candidate template set (S103);

at least selecting a part of words from the extracted words matched with each template in the one or more templates and adding the selected words into a candidate word set (S104);

the candidate words in the set of candidate words are ranked based on the templates in the set of candidate templates, and a number of candidate words are added to the set of new words based on the ranking of the candidate words in the set of candidate words with the templates in the set of candidate templates (S105).

2. The method of claim 1, wherein the one or more templates are obtained by any one of:

predefining said one or more templates, or

And after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with the specific regular expression from the document set subjected to word segmentation processing.

3. The method of claim 1, wherein the step of selecting at least a portion of the templates from the one or more templates for adding to the set of candidate templates comprises any one of:

adding all of the one or more templates into a candidate template set;

adding a portion of the templates to a set of candidate templates based on a number of times each of the one or more templates occurs in the set of documents.

4. The method of claim 3, wherein adding a portion of the templates to the set of candidate templates based on a number of times each of the one or more templates occurs in the set of documents comprises:

adding the template with the first f names in the frequency of occurrence in the document set into a candidate template set, wherein f is a positive integer; or

Adding templates which appear in the document set for more than a certain threshold number of times into a candidate template set.

5. The method of claim 1, wherein the step of selecting at least a portion of the extracted words that match each of the one or more templates to be added to the set of candidate words comprises any one of:

adding all the matched words into a candidate word set;

and adding a part of words into a candidate word set based on the matching times of the matched words and the templates.

6. The method of claim 5, wherein adding a portion of words to a set of candidate words based on the number of matches of the matched words to each template comprises:

adding the words with the first g matched times of each template in the matched words into a candidate word set, wherein g is a positive integer; or

And adding the words of which the matching times with the templates exceed a specific threshold in the matched words into the candidate word set.

7. The method of claim 1, further comprising: before the candidate words in the candidate word set are sorted based on the templates in the candidate template set, the templates in the candidate template set are sorted by a preset new word set, and the candidate template set is filtered based on the sorting of the templates in the candidate template set by the preset new word set.

8. The method of claim 1, further comprising: and ranking the templates in the candidate template set by using the obtained new word set, filtering the candidate template set based on the ranking of the templates in the candidate template set by using the obtained new word set, ranking the candidate words in the candidate word set again by using the filtered candidate template set, and adding a certain number of candidate words into the new word set again based on the ranking of the candidate words in the candidate word set again by using the filtered candidate template set.

9. The method of claim 1, wherein in the step of ordering candidate words in the set of candidate words based on templates in the set of candidate templates,

according to LLR (w)_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i) Calculating weights of the candidate words in the candidate word set, and ordering the candidate words in the candidate word set based on the calculated weights;

wherein, w_iRepresents one candidate word in the set of candidate words, LLR (W)_i) Represents a candidate word w_iCloseness of contact with templates in the candidate template set, E (w)_i) Represents a candidate word w_iLeft entropy of P (w)_i) Represents a candidate word w_iProbability of word-to-word, EMI (w)_i) And 1/NMED (w)_i) Respectively represent the candidate words w_iOf semantic synthesis, wherein LLR (w)_i)、E(w_i)、P(w_i)、EMI(w_i)、1/NMED(w_i) Respectively obtained by the following calculation:

n_1j＝k_1j+k_3j，n_2j＝k_2j+k_4j，

wherein W represents a set of candidate words, P represents a set of candidate templates, W_iRepresenting a candidate word in W, p_jRepresenting one template, k, of the set of candidate templates P_1jIndicating that a match found from the document set to each of the one or more templates contains both w and w_iAnd also contains p_jK is the number of matches_2jIndicating that a match found from the document set with each of the one or more templates contains w_iBut does not contain p_jK is the number of matches_3jIndicating that a match found from the document set with each of the one or more templates contains p_jBut does not contain w_iK is the number of matches_4jIndicating that matches found from the document set to each of the one or more templates contain no p_jNor w_iThe matching number of (2);

where L represents the document set and the candidate word w_iMatching the left side with the left side word l which is already generated and matched with any template in the candidate template set_oSet, c (l)_o) Represents the left-hand word l_oAnd candidate word w_iThe frequency of matching occurring on the left side and matching with any template in the candidate template set, N represents the candidate word w_iA total number of co-occurrences with templates in the set of candidate templates;

wherein t is_hRepresenting a candidate word w in a set of candidate words_iH-th word in (1), n represents a candidate word w_iThe number of individual words contained therein;

all(t_h) Represents a candidate word w_iThe number of times the h-th word in (b) appears in the document set, s (t)_h) Represents a candidate word w_iThe number of times that the h-th word and any word in the document set appear as a single word;

wherein S is the total number of paragraphs in the document set M, and n represents a candidate word w_iThe number of contained characters, F represents that the document set contains candidate words w_iNumber of speech segments of, F_hIndicating that the document set contains a candidate word w_iThe number of the h-th word;

wherein S is the total number of paragraphs in the document set M, and μ (g) represents that the candidate word w is contained in the document set M_iThe number of tokens of all words contained,representing candidate words w contained in the document set M_iAll the contained words appear strictly consecutively in the number of segments of a single phrase.

10. An apparatus for automatically discovering new words from a document collection, comprising:

a template obtaining unit (201) configured to obtain one or more templates, the templates comprising words and spaces or/and custom symbols;

a word extraction unit (202) configured to extract words matching each of the one or more templates from the document set, the words matching each of the one or more templates being words other than the words included in the template;

a candidate template set adding unit (203) configured to select at least a part of the one or more templates from the one or more templates to add to the candidate template set;

a candidate word set adding unit (204) configured to select at least a part of words from the extracted words matched with each of the one or more templates and add the selected words to a candidate word set;

a new word set adding unit (205) configured to rank the candidate words in the candidate word set based on the templates in the candidate template set, and add a certain number of candidate words to the new word set based on the ranking of the candidate words in the candidate word set by the templates in the candidate template set.