CN113705227B

CN113705227B - Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Info

Publication number: CN113705227B
Application number: CN202010437000.3A
Authority: CN
Inventors: 张一帆; 王茂华; 顾倩荣; 黄永健
Original assignee: Shanghai Advanced Research Institute of CAS
Current assignee: Shanghai Advanced Research Institute of CAS
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2023-04-25
Anticipated expiration: 2040-05-21
Also published as: CN113705227A

Abstract

The present invention provides a method, system, medium and equipment for constructing a Chinese word-segmentation-free word embedding model. The method for building the Chinese word-segmentation-free word embedding model includes: counting candidate segments in a corpus and word frequency information corresponding to the candidate segments; Combining the word frequency information to determine the correlation strength of the candidate fragments, and generating a word embedding vocabulary according to the correlation strength; constructing a positive sampling set and a negative sampling set according to the vocabulary, and combining the positive sampling set and the negative sampling set The sampling set builds a word embedding model. The present invention aims at the problem of excessive noise n-grams in the current non-segmented word embedding model vocabulary, takes Chinese corpus as the research object, and utilizes the skip-gram model of negative sampling to provide an improved non-segmented word using unsupervised correlation metrics. A method for word embedding models.

Description

Construction method, system, medium and equipment of Chinese word-segmentation-free word embedding model

技术领域technical field

本发明属于自然语言处理的技术领域，涉及一种词嵌入模型的设计方法，特别是涉及一种中文无分词词嵌入模型的构建方法、系统、介质及设备。The invention belongs to the technical field of natural language processing, and relates to a design method of a word embedding model, in particular to a construction method, system, medium and equipment of a Chinese unsegmented word embedding model.

背景技术Background technique

词嵌入作为自然语言处理领域的一项基础性任务，在机器翻译、词性标注等下游任务中发挥着重要的作用。由于中文语料中的词语之间并没有明显的分隔符，现有的中文词嵌入通常需要首先进行中文分词，以获取分词后的词汇作为词嵌入的目标。但是目前的中文分词仍存在许多问题，而这些问题会严重影响到中文词嵌入的质量。因而，针对类似中文这样的语言，为了避免分词错误带来的影响，无分词词嵌入模型被提出并被证明优于传统的词嵌入方法。As a basic task in the field of natural language processing, word embedding plays an important role in downstream tasks such as machine translation and part-of-speech tagging. Since there are no obvious separators between words in the Chinese corpus, the existing Chinese word embedding usually needs to perform Chinese word segmentation first, so as to obtain the word after word segmentation as the target of word embedding. However, there are still many problems in the current Chinese word segmentation, and these problems will seriously affect the quality of Chinese word embedding. Therefore, for languages like Chinese, in order to avoid the impact of word segmentation errors, a word-segmentation-free word embedding model was proposed and proved to be superior to traditional word embedding methods.

目前的无分词词嵌入模型主要是通过收集Top-K词频最高的n-gram片段作为模型训练的对象。但是仅仅考虑词频会导致词嵌入的词汇表中出现大量噪音n-gram片段，这些噪音片段会影响最终生成的词嵌入的质量。The current non-segmented word embedding model mainly collects n-gram fragments with the highest frequency of Top-K words as the object of model training. But only considering word frequency will lead to a large number of noisy n-gram fragments in the word embedding vocabulary, which will affect the quality of the final generated word embedding.

因此，如何提供一种无分词词嵌入模型的设计方法，降低大量噪音n-gram片段对最终生成的词嵌入模型质量的影响，提高词嵌入模型的质量，实已成为本领域技术人员亟待解决的技术问题。Therefore, how to provide a design method of a word embedding model without word segmentation, reduce the impact of a large number of noise n-gram fragments on the quality of the final generated word embedding model, and improve the quality of the word embedding model has become an urgent problem for those skilled in the art. technical problem.

发明内容Contents of the invention

鉴于以上所述现有技术的缺点，本发明的目的在于提供一种中文无分词词嵌入模型的构建方法、系统、介质及设备，用于解决现有技术无法降低大量噪音n-gram片段对最终生成的词嵌入模型质量的影响，提高词嵌入模型的质量的问题。In view of the shortcomings of the prior art described above, the object of the present invention is to provide a method, system, medium and equipment for building a Chinese word-segmentation-free word embedding model, which is used to solve the problem that the prior art cannot reduce the impact of a large number of noise n-gram fragments on the final The impact of the quality of the generated word embedding model, the problem of improving the quality of the word embedding model.

为实现上述目的及其他相关目的，本发明一方面提供一种中文无分词词嵌入模型的构建方法，所述中文无分词词嵌入模型的构建方法包括：统计语料集中的候选片段及所述候选片段对应的词频信息；结合所述词频信息确定所述候选片段的关联强度，并根据所述关联强度生成词嵌入的词汇表；根据所述词汇表构建正采样集合和负采样集合，并结合所述正采样集合和负采样集合构建词嵌入模型。In order to achieve the above purpose and other related purposes, the present invention provides a method for constructing a Chinese word embedding model without word segmentation, the method for building the Chinese word embedding model includes: a candidate segment in a statistical corpus and the candidate segment Corresponding word frequency information; determine the correlation strength of the candidate segment in combination with the word frequency information, and generate a word embedding vocabulary according to the correlation strength; construct a positive sampling set and a negative sampling set according to the vocabulary, and combine the Positive sampling sets and negative sampling sets build word embedding models.

于本发明的一实施例中，所述候选片段为汉语语言模型片段，所述统计语料集中的候选片段及所述候选片段对应的词频信息的步骤包括：在所述语料集中统计出不同固定长度值对应的汉语语言模型片段及其词频信息。In one embodiment of the present invention, the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and the word frequency information corresponding to the candidate segments includes: counting different fixed lengths in the corpus Value corresponding to the Chinese language model segment and its word frequency information.

于本发明的一实施例中，所述结合所述词频信息确定所述候选片段的关联强度，并根据所述关联强度生成词嵌入的词汇表的步骤包括：结合所述词频信息确定所述候选片段的无监督关联度量指标，所述无监督关联度量指标表征所述候选片段的关联强度；将所述关联强度由大到小依次排列，选取关联强度前K个的候选片段作为词嵌入的词汇表。In an embodiment of the present invention, the step of determining the association strength of the candidate segments in combination with the word frequency information, and generating a word embedding vocabulary according to the association strength includes: determining the candidate fragments in combination with the word frequency information. The unsupervised correlation metric of the segment, the unsupervised correlation metric characterizes the correlation strength of the candidate segment; the correlation strengths are arranged in order from large to small, and the first K candidate segments of the correlation strength are selected as the vocabulary of word embedding surface.

于本发明的一实施例中，所述结合所述词频信息确定所述候选片段的无监督关联度量指标的步骤包括：计算所述候选片段的互信息值，确定所述互信息值最小时对应的片段组合；根据所述片段组合确定第一集合与第二集合，并计算所述片段组合与第一集合或第二集合的统计关系数值；将所述词频信息、互信息值与统计关系数值三者的乘积作为无监督关联度量指标。In an embodiment of the present invention, the step of determining the unsupervised association metric index of the candidate segment in combination with the word frequency information includes: calculating the mutual information value of the candidate segment, and determining when the mutual information value is the smallest corresponding to The segment combination; determine the first set and the second set according to the segment combination, and calculate the statistical relationship value between the segment combination and the first set or the second set; combine the word frequency information, mutual information value and statistical relationship value The product of the three is used as an unsupervised correlation measure.

于本发明的一实施例中，所述根据所述片段组合确定第一集合与第二集合，并计算所述片段组合与第一集合或第二集合的统计关系数值的步骤包括：将所述词频信息与第一集合的词频的比值、与第二集合的词频的比值中的最大值作为分子，选取所述第一集合与第二集合中词频最小的集合，取该集合中元素个数的倒数作为分母；将所述分子与分母构成的分式计算值作为所述片段组合在所述第一集合或第二集合的相对重要程度计算值；根据所述相对重要程度计算值确定所述统计关系数值。In an embodiment of the present invention, the step of determining the first set and the second set according to the fragment combination, and calculating the statistical relationship value between the fragment combination and the first set or the second set includes: The maximum value in the ratio of word frequency information and the word frequency of the first set, and the ratio of the word frequency of the second set is used as the numerator, and the set with the smallest word frequency in the first set and the second set is selected, and the number of elements in the set is taken. The reciprocal is used as the denominator; the fractional calculation value formed by the numerator and the denominator is used as the relative importance calculation value of the fragment combination in the first set or the second set; the statistics are determined according to the relative importance calculation value relationship value.

于本发明的一实施例中，对每一种长度的所述候选片段进行关联强度的计算；根据不同长度的候选片段，在每一种长度下，将所述关联强度由大到小依次排列，选取关联强度前K个的候选片段作为词嵌入的词汇表，以根据不同长度选取不同数量的候选片段作为词嵌入的词汇表。In an embodiment of the present invention, the calculation of the correlation strength is performed on the candidate segments of each length; according to the candidate segments of different lengths, the correlation strengths are arranged in order from large to small under each length , select the top K candidate fragments of the correlation strength as the word embedding vocabulary, and select different numbers of candidate fragments according to different lengths as the word embedding vocabulary.

于本发明的一实施例中，所述根据所述词汇表构建正采样集合和负采样集合，并结合所述正采样集合和负采样集合构建词嵌入模型的步骤包括：以skip-gram模型结合负采样为基础，采用参数优化的方法，最大化正采样概率，最小化负采样概率，构建所述词嵌入模型。In an embodiment of the present invention, the step of constructing a positive sampling set and a negative sampling set according to the vocabulary, and combining the positive sampling set and the negative sampling set to construct a word embedding model includes: combining with a skip-gram model On the basis of negative sampling, the method of parameter optimization is adopted to maximize the probability of positive sampling and minimize the probability of negative sampling to construct the word embedding model.

本发明另一方面提供一种中文无分词词嵌入模型的构建系统，所述中文无分词词嵌入模型的构建系统包括：片段统计模块，用于统计语料集中的候选片段及所述候选片段对应的词频信息；关联度量模块，用于结合所述词频信息确定所述候选片段的关联强度，并根据所述关联强度生成词嵌入的词汇表；模型生成模块，用于根据所述词汇表构建正采样集合和负采样集合，并结合所述正采样集合和负采样集合构建词嵌入模型。Another aspect of the present invention provides a system for constructing a Chinese non-segmented word embedding model. The system for constructing a Chinese non-segmented word embedding model includes: a segment statistics module for counting candidate segments in a corpus and the corresponding candidate segments Word frequency information; Association measurement module, used to determine the association strength of the candidate segment in conjunction with the word frequency information, and generate a word embedding vocabulary according to the association strength; Model generation module, used to construct positive sampling according to the vocabulary set and negative sampling set, and combine the positive sampling set and negative sampling set to build a word embedding model.

本发明又一方面提供一种介质，其上存储有计算机程序，该计算机程序被处理器执行时实现所述的中文无分词词嵌入模型的构建方法。Another aspect of the present invention provides a medium on which a computer program is stored, and when the computer program is executed by a processor, the method for constructing the Chinese word-segmentation-free word embedding model is realized.

本发明最后一方面提供一种设备，包括：处理器及存储器；所述存储器用于存储计算机程序，所述处理器用于执行所述存储器存储的计算机程序，以使所述设备执行所述的中文无分词词嵌入模型的构建方法。The last aspect of the present invention provides a device, including: a processor and a memory; the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the Chinese A method for building word embedding models without segmentation.

如上所述，本发明所述的中文无分词词嵌入模型的构建方法、系统、介质及设备，具有以下有益效果：As mentioned above, the construction method, system, medium and equipment of the Chinese word-segmentation-free word embedding model described in the present invention have the following beneficial effects:

提出一种新的无监督关联度量指标，用于筛选具有强关联度的n-gram片段。将此种无监督关联度量指标与词嵌入模型相结合，构建了一种新的面向中文语料的无分词中文词嵌入模型。通过本发明获得的词嵌入模型能够在下游任务中表现出更好的性能。A new unsupervised association metric is proposed for screening n-gram segments with strong associations. Combining this unsupervised correlation measure with the word embedding model, a new unsegmented Chinese word embedding model for Chinese corpus is constructed. The word embedding model obtained by the present invention can show better performance in downstream tasks.

附图说明Description of drawings

图1显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的原理流程图。Fig. 1 shows the principle flow chart of the construction method of the Chinese word embedding model without word segmentation in one embodiment of the present invention.

图2显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的关联度量流程图。FIG. 2 is a flow chart of correlation measurement in an embodiment of the construction method of the Chinese word-segmentation-free word embedding model of the present invention.

图3显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的关联强度计算流程图。Fig. 3 is a flow chart showing the correlation strength calculation in an embodiment of the construction method of the Chinese word-segmentation-free word embedding model of the present invention.

图4显示为本发明的中文无分词词嵌入模型的构建方法与basic字典对照的效果图。Fig. 4 shows the effect diagram comparing the construction method of the Chinese word-segmentation-free word embedding model of the present invention with the basic dictionary.

图5显示为本发明的中文无分词词嵌入模型的构建方法与rich字典对照的效果图。Fig. 5 shows the effect diagram of the comparison between the construction method of the Chinese word-segmentation-free word embedding model of the present invention and the rich dictionary.

图6显示为本发明的中文无分词词嵌入模型的构建系统于一实施例中的结构原理图。FIG. 6 is a structural schematic diagram of an embodiment of the construction system of the Chinese word-segmentation-free word embedding model of the present invention.

图7显示为本发明的中文无分词词嵌入模型的构建设备于一实施例中的结构连接示意图。FIG. 7 is a schematic diagram showing the structural connection of the construction equipment of the Chinese word-segmentation-free word embedding model in an embodiment of the present invention.

元件标号说明Component designation description

6 中文无分词词嵌入模型的构建系统6 Construction system of Chinese word-segmentation-free word embedding model

61 片段统计模块61 Fragment statistics module

62 关联度量模块62 Association measurement module

63 模型生成模块63 Model generation module

7 设备7 Equipment

71 处理器71 processor

72 存储器72 memory

73 通信接口73 Communication interface

74 系统总线74 System bus

S11～S13 步骤S11～S13 Steps

S121～S122 步骤Steps S121～S122

S121A～S121C 步骤S121A～S121C steps

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。Embodiments of the present invention are described below through specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific implementation modes, and various modifications or changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, in the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should be noted that the diagrams provided in the following embodiments are only schematically illustrating the basic ideas of the present invention, and only the components related to the present invention are shown in the diagrams rather than the number, shape and shape of the components in actual implementation. Dimensional drawing, the type, quantity and proportion of each component can be changed arbitrarily during actual implementation, and the component layout type may also be more complicated.

本发明提供的中文无分词词嵌入模型的构建方法针对目前无分词词嵌入模型词汇表中噪音n-gram过多的问题，以中文语料为研究对象，利用负采样的skip-gram模型，提供了一种使用无监督关联度量指标改进无分词词嵌入模型的方法。The method for constructing the Chinese non-segmented word embedding model provided by the present invention aims at the problem of excessive noise n-grams in the vocabulary of the current non-segmented word embedding model. Taking Chinese corpus as the research object and using the skip-gram model of negative sampling, it provides A method for improving unsegmented word embedding models using unsupervised association metrics.

以下将结合图1至图7详细阐述本实施例的一种中文无分词词嵌入模型的构建方法、系统、介质及设备的原理及实施方式，使本领域技术人员不需要创造性劳动即可理解本实施例的中文无分词词嵌入模型的构建方法、系统、介质及设备。The principles and implementations of a construction method, system, medium and equipment of a Chinese word-segmentation-free word embedding model of this embodiment will be described in detail below in conjunction with FIGS. 1 to 7, so that those skilled in the art can understand this method without creative work The construction method, system, medium and equipment of the Chinese word-segmentation-free word embedding model of the embodiment.

请参阅图1，显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的原理流程图。如图1所示，所述中文无分词词嵌入模型的构建方法具体包括以下几个步骤：Please refer to FIG. 1 , which shows a schematic flow chart of an embodiment of a method for constructing a Chinese word-segmentation-free word embedding model of the present invention. As shown in Figure 1, the construction method of the Chinese word embedding model without word segmentation specifically includes the following steps:

S11，统计语料集中的候选片段及所述候选片段对应的词频信息。S11. Count candidate segments in the corpus and word frequency information corresponding to the candidate segments.

在本实施例中，所述候选片段为汉语语言模型片段，例如，所述候选片段为n-gram片段，在所述语料集中统计出不同固定长度值对应的n-gram片段及其词频信息。In this embodiment, the candidate segment is a Chinese language model segment, for example, the candidate segment is an n-gram segment, and the n-gram segments corresponding to different fixed length values and their word frequency information are counted in the corpus.

具体地，通过n-gram模型实现一个简单的分词器，得到n-gram片段。该模型基于这样一种假设，第n个词的出现只与前面n-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。一般情况下我们只计算一个单词前后各两个词的概率，即n取2,计算n-2,.n-1,n+1,n+2的概率。如果n＝3，计算效果会更好；n＝4，计算量会变得很大。Specifically, implement a simple tokenizer through the n-gram model to obtain n-gram fragments. The model is based on the assumption that the appearance of the nth word is only related to the previous n-1 words, but not to any other words, and the probability of the entire sentence is the product of the occurrence probabilities of each word. In general, we only calculate the probability of two words before and after a word, that is, take n as 2, and calculate the probability of n-2,.n-1, n+1, n+2. If n=3, the calculation effect will be better; if n=4, the calculation amount will become very large.

具体地，对语料集进行整理，统计出固定长度下的所有可能的n-gram片段及其词频信息。将不同长度的n-gram片段与相应的词频信息进行列表管理，形成表1。例如，表1中长度为1个汉字的“一”，词频数为529285。Specifically, the corpus is sorted, and all possible n-gram fragments and their word frequency information under a fixed length are counted. Table 1 is formed by managing the n-gram fragments of different lengths and the corresponding word frequency information in a list. For example, in Table 1, the word "一" with a length of 1 Chinese character has a word frequency of 529285.

表1候选字段表Table 1 Candidate field table

S12，结合所述词频信息确定所述候选片段的关联强度，并根据所述关联强度生成词嵌入的词汇表。S12. Determine the association strength of the candidate segments in combination with the word frequency information, and generate a word embedding vocabulary according to the association strength.

在本实施例中，对每一种长度的所述候选片段进行关联强度的计算。In this embodiment, the correlation strength is calculated for the candidate segments of each length.

进一步地，根据不同长度的候选片段，在每一种长度下，将所述关联强度由大到小依次排列，选取关联强度前K个的候选片段作为词嵌入的词汇表，以根据不同长度选取不同数量的候选片段作为词嵌入的词汇表。Further, according to candidate fragments of different lengths, at each length, the association strengths are arranged in order from large to small, and the top K candidate fragments of association strength are selected as the word embedding vocabulary, so as to select according to different lengths Different numbers of candidate segments serve as vocabulary for word embeddings.

请参阅图2，显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的关联度量流程图。如图2所示，S12包括：Please refer to FIG. 2 , which shows a flow chart of correlation measurement in an embodiment of the construction method of the Chinese word-segmentation-free word embedding model of the present invention. As shown in Figure 2, S12 includes:

S121，结合所述词频信息确定所述候选片段的无监督关联度量指标，所述无监督关联度量指标表征所述候选片段的关联强度。本发明的PATI(Pointwise Associationwith Times Information，无监督关联度量指标)通过考虑更多的统计量信息，能够发掘更多的强关联n-gram片段。S121. Determine an unsupervised association metric of the candidate segment in combination with the word frequency information, where the unsupervised association metric characterizes an association strength of the candidate segment. The PATI (Pointwise Association with Times Information, unsupervised association measurement index) of the present invention can explore more strongly associated n-gram segments by considering more statistical information.

请参阅图3，显示为本发明的中文无分词词嵌入模型的构建方法于一实施例中的关联强度计算流程图。如图3所示，S121包括：Please refer to FIG. 3 , which shows a flow chart of correlation strength calculation in an embodiment of the construction method of the Chinese word-segmentation-free word embedding model of the present invention. As shown in Figure 3, S121 includes:

S121A，计算所述候选片段的互信息值，确定所述互信息值最小时对应的片段组合。S121A. Calculate the mutual information value of the candidate segments, and determine a segment combination corresponding to the minimum mutual information value.

具体地，所述互信息值为MP值，对于每个长度为s的n-gram片段g＝w_iw_i+1...w_i+s(0≤i≤N-s)，g的左、右两部分分别为a＝w_i...w_k-1，b＝w_k...w_i+s(i<k<i+s)，即g＝concat(a,b)。f_a，f_b和f_g分别代表字符串a，b以及n-gram片段g在语料集中的词频。Specifically, the mutual information value is an MP value. For each n-gram segment g=w _i w _i+1 ... w _i+s (0≤i≤Ns) of length s, the left, The right two parts are respectively a=w _i ...w _k-1 , b=w _k ...w _i+s (i<k<i+s), that is, g=concat(a,b). f _a , f _b and f _g represent the word frequencies of strings a, b and n-gram segment g in the corpus, respectively.

对于一个n-gram片段g＝concat(a,b)，其对应的MP定义为：For an n-gram segment g=concat(a,b), its corresponding MP is defined as:

对于固定长度为s的n-gram片段g，总会存在一个特定的左右a，b组合(a_m,b_m)能够最小化MP。随后，AT的计算也将基于这一特定组合(a_m,b_m)。For an n-gram segment g of fixed length s, there always exists a specific combination of left and right a, b (a _m , b _m ) that minimizes MP. Subsequently, the calculation of AT will also be based on this specific combination (a _m ,b _m ).

S121B，根据所述片段组合确定第一集合与第二集合，并计算所述片段组合与第一集合或第二集合的统计关系数值。S121B. Determine a first set and a second set according to the fragment combination, and calculate a statistical relationship value between the fragment combination and the first set or the second set.

在本实施例中，S121B包括：In this embodiment, S121B includes:

(1)将所述词频信息与第一集合的词频的比值、与第二集合的词频的比值中的最大值作为分子，选取所述第一集合与第二集合中词频最小的集合，取该集合中元素个数的倒数作为分母。(1) with the maximum value in the ratio of the word frequency of described word frequency information and the first set, and the ratio of the word frequency of the second set as the numerator, select the set with the smallest word frequency in the first set and the second set, get the The reciprocal of the number of elements in the set is used as the denominator.

对于n-gram片段g的特定组合(a_m,b_m)，有一批与其长度相同的n-gram片段(a_m,b_h)和(a_j,b_m)，那么第一集合{a_m,*}和第二集合{*,b_m}有如下定义：For a specific combination (a _m ,b _m ) of n-gram segments g, there are a batch of n-gram segments (a _m ,b _h ) and (a _j ,b _m ) with the same length, then the first set {a _m ,*} and the second set {*,b _m } are defined as follows:

{a_m,*}＝{(a_m,b₁),(a_m,b₂),…,(a_m,b_h)} 公式(2){a _m ,*}＝{(a _m ,b ₁ ),(a _m ,b ₂ ),…,(a _m ,b _h )} formula (2)

{*,b_m}＝{(a₁,b_m),(a₂,b_m),…,(a_j,b_m)} 公式(3){*,b _m }＝{(a ₁ ,b _m ),(a ₂ ,b _m ),…,(a _j ,b _m )} formula (3)

令

和

分别表示在集合{a_m,*}和{*,b_m}中的所有n-gram片段的词频的总和，则其定义为：make

and

represent the sum of the word frequencies of all n-gram segments in the sets {a _m ,*} and {*,b _m } respectively, then it is defined as:

对于n-gram片段g及其特定组合(a_m,b_m)，变量rate代表的是f_g与

比值以及f_g与

比值中的最大值，即rate的定义如下：For n-gram segment g and its specific combination (a _m , b _m ), the variable rate represents f _g and

Ratio and f _g vs.

The maximum value in the ratio, ie rate, is defined as follows:

对于两个集合{a_m,*}和{*,b_m}及其对应的

和

令sizeof代表集合内n-gram元素的个数，则AC可定义为：For two sets {a _m ,*} and {*,b _m } and their corresponding

and

Let sizeof represent the number of n-gram elements in the set, then AC can be defined as:

(2)将所述分子与分母构成的分子计算值作为所述片段组合在所述第一集合或第二集合的相对重要程度计算值。(2) The calculated value of the numerator composed of the numerator and the denominator is used as the calculated value of the relative importance of the combination of the fragments in the first set or the second set.

具体地，给定变量rate和变量AC，此时n-gram片段g的times值的定义如下：Specifically, given the variable rate and the variable AC, the times value of the n-gram segment g is defined as follows:

(3)根据所述相对重要程度计算值确定所述统计关系数值。(3) Determine the value of the statistical relationship according to the calculated value of the relative importance.

具体地，对于长度为s的n-gram片段g的特定组合(a_m,b_m)，有唯一的变量times，则AT的计算公式为：Specifically, for a specific combination (a _m ,b _m ) of an n-gram segment g of length s, there is a unique variable times, then the calculation formula of AT is:

AT＝1+|logtimes| 公式(9)AT＝1+|logtimes| Formula (9)

S121C，将所述词频信息、互信息值与统计关系数值三者的乘积作为无监督关联度量指标。S121C, taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised correlation measurement index.

具体地，PATI(Pointwise Association with Times Information，无监督关联度量指标)的公式如下：Specifically, the formula of PATI (Pointwise Association with Times Information, unsupervised association metrics) is as follows:

PATI＝F×MP×AT 公式(10)PATI＝F×MP×AT Formula (10)

其中，F＝fg为词频信息，MP为互信息值，AT是统计关系数值。Among them, F=fg is the word frequency information, MP is the mutual information value, and AT is the statistical relationship value.

MP是对于互信息PMI的改进版，其在关联强度的计算过程中更多的考虑到了每个n-gram片段g＝concat(a,b)的边际变量，即n-gram的左、右两部分a和b的统计量，从而能够对n-gram的局部信息更为敏感。MP is an improved version of the mutual information PMI, which takes into account more of the marginal variables of each n-gram segment g=concat(a,b) in the calculation of the correlation strength, that is, the left and right sides of the n-gram The statistics of parts a and b can be more sensitive to the local information of n-gram.

AT通过利用每个n-gram片段的特定组合(a_m,b_m)在集合{a_m,*}或{*,b_m}中的统计信息来进一步衡量n-gram的关联强度。变量times通过考虑

和(a_m,b_m)的词频信息以及(a_m,b_m)的前、后邻接数，来评估(a_m,b_m)在集合中的相对重要程度，times值越高，通常说明(a_m,b_m)作为一个整体是合理的。一般来说，大部分关联强度较高的合理n-gram的times值要远大于那些不合理的n-gram片段的times值。AT further measures the association strength of n-grams by exploiting the statistics of each n-gram fragment for a specific combination ( _am , b _m ) in the set { _am ,*} or {*,b _m }. The variable times is considered by

and (a _m ,b _m ) word frequency information and (a _m ,b _m ) front and back adjacency numbers to evaluate the relative importance of (a _m ,b _m ) in the set, the higher the times value, usually indicates (a _m , b _m ) is reasonable as a whole. Generally speaking, the times value of most reasonable n-grams with high correlation strength is much larger than the times value of those unreasonable n-gram fragments.

S122，将所述关联强度由大到小依次排列，选取关联强度前K个的候选片段作为词嵌入的词汇表。S122. Arrange the correlation strengths in descending order, and select K candidate segments with the top K correlation strengths as a word embedding vocabulary.

具体地，使用提出的无监督关联度量指标来计算每一种片段长度下候选n-gram片段的关联强度。然后选取Top-K关联强度最高的n-gram片段作为词嵌入模型的词汇表。其中，在一大堆数中求其前k大或前k小的问题，简称Top-K问题。Specifically, the proposed unsupervised association metric is used to calculate the association strength of candidate n-gram segments for each segment length. Then select the n-gram segment with the highest Top-K correlation strength as the vocabulary of the word embedding model. Among them, the problem of finding the first k largest or the first k smallest among a large number of numbers is referred to as the Top-K problem.

S13，根据所述词汇表构建正采样集合和负采样集合，并结合所述正采样集合和负采样集合构建词嵌入模型。S13. Construct a positive sampling set and a negative sampling set according to the vocabulary, and combine the positive sampling set and the negative sampling set to construct a word embedding model.

在本实施例中，以skip-gram模型结合负采样为基础，采用极大似然估计的方法，最大化正采样概率，最小化负采样概率，构建所述词嵌入模型。本发明使用无监督关联度量指标来筛选词嵌入的词汇表，重构词嵌入模型的正采样和负采样，降低了噪音n-gram片段对于模型的影响，从而提升了词嵌入在下游任务中的表现。In this embodiment, based on the skip-gram model combined with negative sampling, the maximum likelihood estimation method is used to maximize the positive sampling probability and minimize the negative sampling probability to construct the word embedding model. The present invention uses unsupervised correlation metrics to filter word embedding vocabulary, reconstructs positive sampling and negative sampling of the word embedding model, reduces the influence of noisy n-gram fragments on the model, and thus improves the efficiency of word embedding in downstream tasks. Performance.

需要说明的是，极大似然估计仅为本发明进行参数估计和优化的一种实施方式，其他的可实现参数估计与优化的方法也包含在本发明保护的范围内。It should be noted that the maximum likelihood estimation is only one embodiment of parameter estimation and optimization in the present invention, and other methods that can realize parameter estimation and optimization are also included in the protection scope of the present invention.

具体地，PFNE基于负采样的skip-gram模型来学习词嵌入，从而降低模型梯度下降时的计算量，加快模型训练。模型的正采样集合N_p就是词汇表和语料相结合生成的“中心词-上下文对”(w_t,w_c)，负采样集合N_n则是通过构建一个足够大的一元语言模型词表，在表中对每个n-gram片段进行索引，根据词汇表中n-gram的词频大小来随机获取负采样样本。PFNE模型的目标函数如下定义：Specifically, PFNE learns word embedding based on the skip-gram model of negative sampling, thereby reducing the amount of calculation when the model is gradient-descent and speeding up model training. The positive sampling set N _p of the model is the "central word-context pair" (w _t , w _c ) generated by the combination of vocabulary and corpus, and the negative sampling set N _n is constructed by building a sufficiently large unary language model vocabulary. Each n-gram segment is indexed in the table, and negative sampling samples are randomly obtained according to the word frequency of the n-gram in the vocabulary. The objective function of the PFNE model is defined as follows:

其中，

和

分别是中心词w_t和其上下文w_c的向量表示，模型使用极大似然估计，根据中心词来预测上下文，最大化正样本的概率，同时最小化负采样的概率，以使目标函数生成的词嵌入模型最优。对该目标函数的优化采用的是基于正、负采样的随机梯度下降方法。in,

and

They are the vector representations of the central word w _t and its context w _c respectively. The model uses maximum likelihood estimation to predict the context according to the central word, maximize the probability of positive samples, and minimize the probability of negative samples, so that the objective function generates The best word embedding model. The optimization of this objective function adopts the stochastic gradient descent method based on positive and negative sampling.

请参阅图4和图5，分别显示为本发明的中文无分词词嵌入模型的构建方法与basic字典对照的效果图和本发明的中文无分词词嵌入模型的构建方法与rich字典对照的效果图。在图4和图5中，PFNE代表利用PATI算法筛选出的n-gram片段与词典相对照的结果；sembei为利用频率(词频)筛选出的n-gram片段与词典相对照的结果；SGNS-PMI为利用PMI(Pointwise Mutual Information，互信息)筛选出的n-gram片段与词典相对照的结果。纵轴为精确率，横轴为召回率。其中，精确率和召回率的表达式如下：Please refer to Fig. 4 and Fig. 5, which are respectively shown as the effect diagram of the Chinese non-segmented word embedding model of the present invention compared with the basic dictionary and the effect diagram of the Chinese non-segmented word embedding model of the present invention compared with the rich dictionary . In Figure 4 and Figure 5, PFNE represents the result of comparing the n-gram fragments screened by the PATI algorithm with the dictionary; sembei is the result of comparing the n-gram fragments screened by frequency (word frequency) with the dictionary; SGNS- PMI is the result of comparing the n-gram fragments screened by PMI (Pointwise Mutual Information, mutual information) with the dictionary. The vertical axis is the precision rate, and the horizontal axis is the recall rate. Among them, the expressions of precision and recall are as follows:

进一步地，曲线越高越长，说明筛选出的合理n-gram片段越多。在图4和图5中可以看出，利用PATI算法的实线PFNE的曲线在三条曲线中最高、最长，由此说明，本发明的中文无分词词嵌入模型的构建方法与现有技术的词嵌入模型(basic字典或rich字典)相比，可以筛选出更多合理的n-gram片段。Further, the higher and longer the curve, the more reasonable n-gram fragments are screened out. As can be seen in Fig. 4 and Fig. 5, the curve of the solid line PFNE utilizing the PATI algorithm is the highest and longest among the three curves, thus illustrating that the construction method of the Chinese no-segmentation word embedding model of the present invention is different from that of the prior art Compared with the word embedding model (basic dictionary or rich dictionary), more reasonable n-gram fragments can be screened out.

本发明所述的中文无分词词嵌入模型的构建方法的保护范围不限于本实施例列举的步骤执行顺序，凡是根据本发明的原理所做的现有技术的步骤增减、步骤替换所实现的方案都包括在本发明的保护范围内。The scope of protection of the construction method of the Chinese word-segmentation-free word embedding model described in the present invention is not limited to the execution sequence of the steps listed in this embodiment. All schemes are included in the protection scope of the present invention.

本实施例提供一种计算机存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现所述中文无分词词嵌入模型的构建方法。This embodiment provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for constructing the Chinese word-segmentation-free word embedding model is realized.

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过计算机程序相关的硬件来完成。前述的计算机程序可以存储于一计算机可读存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的计算机可读存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的计算机存储介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by hardware related to computer programs. The aforementioned computer program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps of the above-mentioned method embodiments; and the aforementioned computer-readable storage medium includes: ROM, RAM, magnetic disk or optical disk and other computer storage media that can store program codes.

以下将结合图示对本实施例所提供的中文无分词词嵌入模型的构建系统进行详细描述。需要说明的是，应理解以下系统的各个模块的划分仅仅是一种逻辑功能的划分，实际实现时可以全部或部分集成到一个物理实体上，也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现，也可以全部以硬件的形式实现，还可以部分模块通过处理元件调用软件的形式实现，部分模块通过硬件的形式实现。例如：某一模块可以为单独设立的处理元件，也可以集成在下述系统的某一个芯片中实现。此外，某一模块也可以以程序代码的形式存储于下述系统的存储器中，由下述系统的某一个处理元件调用并执行以下某一模块的功能。其它模块的实现与之类似。这些模块全部或部分可以集成在一起，也可以独立实现。这里所述的处理元件可以是一种集成电路，具有信号的处理能力。在实现过程中，上述方法的各步骤或以下各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。The system for constructing the Chinese word-segmentation-free word embedding model provided by this embodiment will be described in detail below with reference to diagrams. It should be noted that it should be understood that the division of the various modules of the following systems is only a division of logical functions, and may be fully or partially integrated into a physical entity or physically separated during actual implementation. Moreover, these modules can be implemented in the form of calling software through processing elements, or can be implemented in the form of hardware, or some modules can be implemented in the form of calling software through processing elements, and some modules can be implemented in the form of hardware. For example: a certain module may be a separate processing element, or it may be integrated into a certain chip of the following system. In addition, a certain module may also be stored in the memory of the system described below in the form of program code, and be called by a processing element of the system described below to execute the function of a certain module described below. The implementation of other modules is similar. All or part of these modules can be integrated together, and can also be implemented independently. The processing element mentioned here may be an integrated circuit with signal processing capability. In the implementation process, each step of the above method or the following modules can be completed by an integrated logic circuit of hardware in the processor element or an instruction in the form of software.

以下这些模块可以是被配置成实施以上方法的一个或多个集成电路，例如：一个或多个特定集成电路(Application Specific Integrated Circuit，简称ASIC)，一个或多个数字信号处理器(Digital Signal Processor，简称DSP)，一个或者多个现场可编程门阵列(Field Programmable Gate Array，简称FPGA)等。当以下某个模块通过处理元件调用程序代码的形式实现时，该处理元件可以是通用处理器，如中央处理器(Central ProcessingUnit，简称CPU)或其它可以调用程序代码的处理器。这些模块可以集成在一起，以片上系统(System-on-a-chip，简称SOC)的形式实现。The following modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, referred to as ASIC), one or more digital signal processors (Digital Signal Processor , DSP for short), one or more Field Programmable Gate Arrays (Field Programmable Gate Array, FPGA for short), etc. When one of the following modules is implemented in the form of a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, referred to as CPU) or other processors that can call program code. These modules can be integrated together and implemented in the form of a System-on-a-chip (SOC for short).

请参阅图6，显示为本发明的中文无分词词嵌入模型的构建系统于一实施例中的结构原理图。如图6所示，所述中文无分词词嵌入模型的构建系统6包括：片段统计模块61、关联度量模块62和模型生成模块63。Please refer to FIG. 6 , which is a structural schematic diagram of an embodiment of the construction system of the Chinese word-segmentation-free word embedding model of the present invention. As shown in FIG. 6 , the construction system 6 of the Chinese word embedding model without word segmentation includes: a fragment statistics module 61 , an association measurement module 62 and a model generation module 63 .

所述片段统计模块61用于统计语料集中的候选片段及所述候选片段对应的词频信息。The segment statistics module 61 is used to count the candidate segments in the corpus and the word frequency information corresponding to the candidate segments.

在本实施例中，所述候选片段为汉语语言模型片段，所述片段统计模块61具体用于在所述语料集中统计出不同固定长度值对应的汉语语言模型片段及其词频信息。In this embodiment, the candidate segments are Chinese language model segments, and the segment statistics module 61 is specifically used to count the Chinese language model segments corresponding to different fixed length values and their word frequency information in the corpus.

所述关联度量模块62用于结合所述词频信息确定所述候选片段的关联强度，并根据所述关联强度生成词嵌入的词汇表。The association measurement module 62 is used to determine the association strength of the candidate segments in combination with the word frequency information, and generate a word embedding vocabulary according to the association strength.

在本实施例中，所述关联度量模块62具体用于结合所述词频信息确定所述候选片段的无监督关联度量指标，所述无监督关联度量指标表征所述候选片段的关联强度；将所述关联强度由大到小依次排列，选取关联强度前K个的候选片段作为词嵌入的词汇表。In this embodiment, the association measurement module 62 is specifically configured to determine the unsupervised association measurement index of the candidate segment in combination with the word frequency information, and the unsupervised association measurement index represents the association strength of the candidate segment; The above correlation strengths are arranged in order from large to small, and the top K candidate segments of the correlation strength are selected as the word embedding vocabulary.

所述模型生成模块63用于根据所述词汇表构建正采样集合和负采样集合，并结合所述正采样集合和负采样集合构建词嵌入模型。The model generation module 63 is used to construct a positive sampling set and a negative sampling set according to the vocabulary, and combine the positive sampling set and the negative sampling set to construct a word embedding model.

在本实施例中，所述模型生成模块63具体用于以skip-gram模型结合负采样为基础，采用参数优化的方法，最大化正采样概率，最小化负采样概率，构建所述词嵌入模型。In this embodiment, the model generation module 63 is specifically used to construct the word embedding model based on the skip-gram model combined with negative sampling, adopting a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability .

本发明所述的中文无分词词嵌入模型的构建系统可以实现本发明所述的中文无分词词嵌入模型的构建方法，但本发明所述的中文无分词词嵌入模型的构建方法的实现装置包括但不限于本实施例列举的中文无分词词嵌入模型的构建系统的结构，凡是根据本发明的原理所做的现有技术的结构变形和替换，都包括在本发明的保护范围内。The construction system of the Chinese non-segmented word embedding model of the present invention can realize the construction method of the Chinese non-segmented word embedding model of the present invention, but the realization device of the construction method of the Chinese non-segmented word embedding model of the present invention includes But it is not limited to the structure of the construction system of the Chinese word-segmentation-free word embedding model listed in this embodiment. Any structural deformation and replacement of the prior art based on the principles of the present invention are included in the scope of protection of the present invention.

请参阅图7，显示为本发明的中文无分词词嵌入模型的构建设备于一实施例中的结构连接示意图。如图7所示，本实施例提供一种设备7，所述设备7包括：处理器71、存储器72、通信接口73或/和系统总线74；存储器72和通信接口73通过系统总线74与处理器71连接并完成相互间的通信，存储器72用于存储计算机程序，通信接口73用于和其他设备进行通信，处理器71用于运行计算机程序，使所述设备7执行所述中文无分词词嵌入模型的构建方法的各个步骤。Please refer to FIG. 7 , which is a schematic structural connection diagram of an embodiment of the construction equipment of the Chinese word-segmentation-free word embedding model of the present invention. As shown in Figure 7, the present embodiment provides a kind of equipment 7, and described equipment 7 comprises: processor 71, memory 72, communication interface 73 or/and system bus 74; Memory 72 and communication interface 73 communicate with processing through system bus 74 The device 71 is connected and completes mutual communication, the memory 72 is used to store computer programs, the communication interface 73 is used to communicate with other devices, and the processor 71 is used to run computer programs, so that the device 7 executes the Chinese word without participle Steps in the method of building the embedding model.

上述提到的系统总线74可以是外设部件互连标准(Peripheral ComponentInterconnect，简称PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture，简称EISA)总线等。该系统总线可以分为地址总线、数据总线、控制总线等。通信接口73用于实现数据库访问装置与其他设备(如客户端、读写库和只读库)之间的通信。存储器72可能包含随机存取存储器(Random Access Memory，简称RAM)，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The system bus 74 mentioned above may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (EISA for short) bus or the like. The system bus can be divided into address bus, data bus, control bus and so on. The communication interface 73 is used to realize the communication between the database access device and other devices (such as client, read-write library and read-only library). The memory 72 may include a random access memory (Random Access Memory, RAM for short), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

上述的处理器71可以是通用处理器，包括中央处理器(Central ProcessingUnit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(Digital Signal Processing，简称DSP)、专用集成电路(Alication SpecificIntegrated Circuit，简称ASIC)、现场可编程门阵列(Field Programmable Gate Array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。Above-mentioned processor 71 can be general-purpose processor, comprises central processing unit (Central Processing Unit, be called for short CPU), network processor (Network Processor, be called for short NP) etc.; Can also be digital signal processor (Digital Signal Processing, be called for short DSP) , Application Specific Integrated Circuit (ASIC for short), Field Programmable Gate Array (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

综上所述，本发明所述中文无分词词嵌入模型的构建方法、系统、介质及设备提出一种新的无监督关联度量指标，用于筛选具有强关联度的n-gram片段。将此种无监督关联度量指标与词嵌入模型相结合，构建了一种新的面向中文语料的无分词中文词嵌入模型。通过本发明获得的词嵌入模型能够在下游任务中表现出更好的性能。本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。To sum up, the construction method, system, medium and equipment of the Chinese non-segmented word embedding model of the present invention propose a new unsupervised correlation measurement index for screening n-gram segments with strong correlation. Combining this unsupervised correlation measure with the word embedding model, a new unsegmented Chinese word embedding model for Chinese corpus is constructed. The word embedding model obtained by the present invention can show better performance in downstream tasks. The invention effectively overcomes various shortcomings in the prior art and has high industrial application value.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above-mentioned embodiments only illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and technical ideas disclosed in the present invention should still be covered by the claims of the present invention.

Claims

1. The method for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following steps of:

counting candidate segments in a corpus and word frequency information corresponding to the candidate segments;

determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength;

constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;

determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength, wherein the word frequency information comprises the following steps: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the previous association strength as a word embedded vocabulary;

wherein determining the unsupervised association metric of the candidate segment in combination with the word frequency information comprises:

a, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; the mutual information value is MP value, MP is defined as:

f _a ，f _b and f _g The word frequency of the character strings a, b and the n-gram fragment g in the corpus is represented respectively;

b, determining a first set and a second set according to the fragment combination, and calculating a statistical relation value between the fragment combination and the first set or the second set, wherein the method comprises the following steps:

(1) Taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;

for a specific combination of n-gram fragments g (a _m ,b _m ) A batch of n-gram fragments (a _m ,b _h ) And (a) _j ,b _m ) Then the first set { a } _m (x) and (b) a second set _m Defined as: { a _m ,*}＝{(a _m ,b ₁ ),(a _m ,b ₂ ),…,(a _m ,b _h ) } and { b _m }＝{(a ₁ ,b _m ),(a ₂ ,b _m ),…,(a _j ,b _m ) -a }; order the

And->

Respectively expressed in the set { a } _m Sum }, b _m The sum of the word frequencies of all n-gram fragments in }, respectively defined as:

and +.>

For n-gram fragment gSpecific combinations (a) _m ,b _m ) The variable rate represents f _g And (3) with

Ratio f _g And->

Maximum in the ratio, rate is defined as: />

For two sets { a } _m Sum }, b _m ' and its corresponding

And->

Let sizeof represent the number of n-gram elements in the set, then AC is defined as: />

(2) Combining the calculated values of the numerator and the denominator as calculated values of the relative importance of the segment combination in the first set or the second set; given a variable rate as a numerator and a variable AC as a denominator, the calculated relative importance value time value of the n-gram fragment g is defined as:

(3) Determining the statistical relationship value AT according to the relative importance degree calculated value time; the calculation formula of the statistical relation value AT is as follows: at=1+|logtimes|;

c, making the word frequency information F equal to F _g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; said non-monitoringThe formula of the governor-connection measurement index PATI is as follows: pati=f×mp×at.

2. The method for constructing a word-segmentation-free Chinese word embedding model according to claim 1, wherein the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and word frequency information corresponding to the candidate segments comprises:

and counting the Chinese language model fragments and word frequency information thereof corresponding to different fixed length values in the corpus.

3. The method for constructing a Chinese word-segmentation-free word embedding model according to claim 1, wherein the method comprises the following steps of:

calculating the association strength of the candidate fragments of each length;

according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, and K candidate fragments with the front association strength are selected as word embedded vocabularies, so that different numbers of candidate fragments are selected as word embedded vocabularies according to different lengths.

4. The method for constructing word-embedding models without word segmentation in chinese according to claim 1, wherein the step of constructing positive and negative sampling sets according to the vocabulary and constructing the word-embedding models in combination with the positive and negative sampling sets includes:

based on a skip-gram model combined with negative sampling, a parameter optimization method is adopted to maximize positive sampling probability and minimize negative sampling probability, and the word embedding model is constructed.

5. The system for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following components:

the segment statistics module is used for counting candidate segments in the corpus and word frequency information corresponding to the candidate segments;

the association measurement module is used for determining association strength of the candidate fragments by combining the word frequency information and generating a word embedded vocabulary according to the association strength;

the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;

for a specific combination of n-gram fragments g (a _m ,b _m ) A batch of n-gram fragments (a _m ,b _h ) And (a) _j ,b _m ) Then the first set { a } _m Sum of }, andtwo sets {, b _m Defined as: { a _m ,*}＝{(a _m ,b ₁ ),(a _m ,b ₂ ),…,(a _m ,b _h ) } and { b _m }＝{(a ₁ ,b _m ),(a ₂ ,b _m ),…,(a _j ,b _m ) -a }; order the

And->

Respectively expressed in the set { a } _m Sum }, b _m The sum of the word frequencies of all n-gram fragments in }, respectively defined as: />

and

For n-gram fragment g and specific combinations thereof (a _m ,b _m ) The variable rate represents f _g And (3) with

Ratio f _g And->

Maximum in the ratio, rate is defined as: />

For two sets { a } _m Sum }, b _m ' and its corresponding

And->

c, making the word frequency information F equal to F _g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; the formula of the unsupervised association metric PATIs is as follows: pati=f×mp×at.

6. A medium having stored thereon a computer program, which when executed by a processor, implements a method of constructing a chinese word-segmentation-free embedding model according to any one of claims 1 to 4.

7. An apparatus, comprising: a processor and a memory;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the device executes the method for constructing the Chinese word-segmentation-free embedding model according to any one of claims 1 to 4.