WO2020063071A1 - 基于卡方检验的句向量计算方法、文本分类方法及系统 - Google Patents

基于卡方检验的句向量计算方法、文本分类方法及系统 Download PDF

Info

Publication number
WO2020063071A1
WO2020063071A1 PCT/CN2019/097187 CN2019097187W WO2020063071A1 WO 2020063071 A1 WO2020063071 A1 WO 2020063071A1 CN 2019097187 W CN2019097187 W CN 2019097187W WO 2020063071 A1 WO2020063071 A1 WO 2020063071A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
vector
weight
feature
chi
Prior art date
Application number
PCT/CN2019/097187
Other languages
English (en)
French (fr)
Inventor
黄友福
肖龙源
蔡振华
李稀敏
刘晓葳
谭玉坤
Original Assignee
厦门快商通信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门快商通信息技术有限公司 filed Critical 厦门快商通信息技术有限公司
Publication of WO2020063071A1 publication Critical patent/WO2020063071A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the invention relates to the technical field of natural language computer automatic processing, in particular to a method for calculating sentence vectors based on chi-square test and a text classification method and system using the method.
  • Text classification is an important step in natural language processing. Text classification refers to the process of automatically determining the text category according to the text content under a given classification system. Prior to the 1990s, the dominant text classification method has been a classification method based on knowledge engineering, that is, manual classification by professionals .Manual classification is very time-consuming and very inefficient. Since the 1990s, many statistical methods and machine learning methods have been applied to automatic text classification. The study of text classification technology has attracted great interest from researchers. At present, Chinese has also begun to be used in China. Research on text classification, and has obtained preliminary applications in many fields such as information retrieval, automatic classification of web documents, digital libraries, automatic abstracts, classified newsgroups, text filtering, word semantic analysis, and organization and management of documents.
  • the commonly used early text classification technology is to make a sufficiently long vector by using the One-Hot Encoding method, and each dimension of the vector represents a word or phrase.
  • the pointed vocabulary of a particular dimension appears in the sentence, the value of the vector in that dimension is 1, otherwise it is 0.
  • This one-hot encoding method can convert sentences into vectors of fixed length, but this method has problems such as uncertain vector space and explosion of vector dimensions, resulting in low model training efficiency.
  • the commonly used method is to first use Chinese text segmentation technology to segment long text into words, then use word2vec (word vector technology) to convert words into vectors of a certain dimension, and use the arithmetic mean of word vectors in sentences as sentence vectors.
  • word2vec word vector technology
  • the algorithm of this sentence vector when the sentence contains a large number of related words, the characteristics of the sentence vector may not be obvious enough, resulting in a lower accuracy of text classification.
  • the present invention provides a method for calculating sentence vectors, a text classification method, and a system based on the chi-square test.
  • the method strengthens key features in the text, reduces mutual interference between word vectors in text information, and improves The weight of the sentence vector in the feature dimension, thereby improving the accuracy of text classification.
  • a method for calculating a sentence vector based on a chi-square test includes the following steps:
  • d. calculating a use frequency of the feature word in the preset category, assigning a first weight value to the feature word and a second weight value to the non-feature word according to the use frequency; and The first weight is greater than the second weight;
  • the method further includes performing context expansion on the current text to obtain an extended text, and then performing word segmentation processing on the extended text.
  • the word vector is calculated by using the trained word vector model on the word segmentation result; the training of the word vector model is obtained by performing word segmentation processing on the training corpus and removing stop words.
  • the word segmentation result is input to the word vector model for training, to obtain a word vector of each word of the training corpus.
  • a chi-square value between each word vector and a preset category is calculated.
  • the preset category refers to class recognition of each word vector using a preset classification algorithm, or Classify each word vector to obtain the category to which each word vector belongs.
  • step c dividing the word vector into feature words and non-feature words according to the chi-square value means that a word vector whose chi-square value is less than or equal to a preset value is used as a feature word, and Use a word vector whose chi-square value is greater than a preset value as a non-feature word; or, sort the word vectors in ascending order of the chi-square value, and use a preset number of word vectors as feature words , The other word vectors that are sorted as non-feature words.
  • calculating the frequency of using the feature words in the preset category refers to classifying the corpus according to the preset category to obtain different types of text sets; and then calculating the feature words The percentage of text in each category.
  • step d assigning a first weight to the feature word and a second weight to the non-feature word according to the frequency of use means that the maximum value of the ratio is taken as the When the weight of the feature word is described, the first weight is obtained; and when a preset constant is used as the weight of the non-feature word, the second weight is obtained.
  • the calculation method of the sentence vector is: for each word vector in the current text, if it is a feature word, the word vector of the feature word is multiplied by the corresponding first Weight and accumulation; if it is a non-feature word, the word vector of the non-feature word is multiplied by the corresponding second weight and accumulated; finally the sum of the word vectors obtained is divided by the sum of the weights of all word vectors to obtain The sentence vector; that is:
  • Sentence vector (word vector of feature word 1 * first weight 1 + word vector of feature word 2 * first weight 2 + ... + word vector of feature word m * first weight m + non-feature word 1 Word vector * second weight 1 + word vector of non-feature word 2 * second weight 2+ ... + word vector of non-feature word n * second weight n) / (first weight 1 + first Weight 2 + ... + first weight m + second weight 1 + second weight 2 + ... + second weight n)).
  • the present invention also provides a text classification method, which uses the chi-square test-based sentence vector calculation method according to any one of the above, and classifies the current text according to the sentence vector; that is, classifies the current text. And the corresponding sentence vector input is based on the random forest intent recognition classification model for prediction, and the type of the current text is output.
  • the present invention also provides a text classification system, which includes:
  • Text preprocessing module which is used to perform word segmentation on the current text and remove stop words to obtain the word segmentation result
  • a word vector calculation module configured to calculate a word vector for each word in the word segmentation result
  • a chi-square test module configured to calculate a chi-square value between each word vector and a preset category, and divide the word vector into feature words and non-feature words according to the chi-square value;
  • a weight setting module which calculates a frequency of use of the feature word in the preset category, assigns a first weight to the feature word according to the frequency of use, and assigns a second weight to the non-feature word value;
  • a sentence vector calculation module which calculates a weighted average of all word vectors according to the feature words and the non-feature word word vectors and corresponding weight values, and uses the sentence vector as a sentence vector of the current text;
  • a text classification module that classifies the current text according to the sentence vector; that is, inputs the current text and the corresponding sentence vector into a random forest-based intent recognition classification model for prediction, and outputs the type of the current text.
  • the present invention reduces the mutual interference between word vectors in text information by strengthening key features in text, and increases the weight of sentence vectors in feature dimensions, thereby improving the accuracy of text classification;
  • the present invention sets the weight of each word vector according to the feature word and non-feature word and the frequency of use of the feature word, thereby increasing the weight of the sentence vector in the feature dimension, reducing the disturbance of irrelevant words, and thus The effect of sentence vectors on sentence semantics;
  • the present invention further performs context expansion on the current text to obtain an extended text, and then performs word segmentation processing on the extended text to obtain a vector of the current sentence by increasing the weight of the context word, so that the sentence vector expresses the sentence semantic The effect is more accurate;
  • the text classification method of the present invention is to input the current text and the corresponding sentence vector into a classification model based on the intention recognition of a random forest for prediction, and output the type of the current text, so that the model prediction result evaluation is significantly improved.
  • FIG. 1 is a schematic flowchart of a sentence vector calculation method based on the chi-square test according to the present invention
  • FIG. 2 is a schematic flowchart of a text classification method according to the present invention.
  • FIG. 3 is a schematic structural diagram of a text classification system according to the present invention.
  • a method for calculating a sentence vector based on the chi-square test of the present invention includes the following steps:
  • d. calculating a use frequency of the feature word in the preset category, assigning a first weight value to the feature word and a second weight value to the non-feature word according to the use frequency; and The first weight is greater than the second weight;
  • the method further includes performing context expansion on the current text to obtain an extended text, and then performing word segmentation processing on the extended text.
  • the current text is expanded upward and / or downward to more than three sentences. For example, if the current text is the middle sentence of the text, the extended text includes the current sentence, the previous sentence of the current sentence, and the next sentence of the current sentence; if the current text is the first sentence of the text, the extended text Including the current sentence and the next two sentences of the current sentence; if the current text is the last sentence of the text, the extended text includes the current sentence and the previous two sentences of the current sentence.
  • step a the stop words are removed by looking up the stop word list, and removing the words in the stop word list in the stop word result as stop words; and further removing the stop words
  • the following words are reduced by part of speech.
  • the word segmentation technology mainly uses the word segmentation (Jieba).
  • the word segmentation is to divide the Chinese sentence according to the word granularity. It supports three word segmentation modes: one is the precise mode, which tries to cut the sentence most accurately.
  • the second is the full mode, which scans all the words that can form words in the sentence, which is very fast, but it cannot resolve the ambiguity;
  • the third is the search engine mode, based on the precise mode, again for long words Segmentation to improve the recall rate, suitable for search engine segmentation. It also supports traditional word segmentation and custom dictionaries.
  • the step b refers to calculating a word vector using the trained word vector model on the word segmentation result; the training of the word vector model is to perform word segmentation processing on the training corpus and remove stop words to obtain the word segmentation result.
  • the word segmentation result is input into the word vector model for training, and a word vector of each word of the training corpus is obtained.
  • the word vector model uses the word2vec model, and the word segmentation results of the training corpus are input into the model for training to obtain the trained word2vec model.
  • the word segmentation results of the current text are input into the training.
  • the word2vec model the word segmentation result is converted into a set of word vectors.
  • word2vec is also called word embeddings, the Chinese name "word vector", whose role is to convert words in natural language into dense vectors that the computer can understand.
  • word2vec is mainly divided into CBOW (Continuous Bag of Words) and Skip-Gram two modes.
  • CBOW infers the target word from the original sentence
  • Skip-Gram on the other hand, infers the original sentence from the target word.
  • CBOW is more suitable for small databases, and Skip-Gram performs better in large corpora. Those skilled in the art can choose the required mode according to actual needs.
  • step c a chi-square value between each word vector and a preset category is calculated.
  • the preset category refers to class recognition of each word vector by using a preset classification algorithm, or The vectors are labeled with categories to obtain the category to which each word vector belongs.
  • the preset classification algorithm may use Naive Bayes (NB), Decision Tree (DT), K-nearest neighbors (KNN), and so on. It is also possible to manually label the category to which the word vector belongs directly.
  • Chi-square test is one of the founders of modern statistics.
  • a widely used statistical method proposed by the British K. Pearson (1857-1936) in 1900, can be used for two or more Comparison of rates, correlation analysis of count data, goodness-of-fit test, etc.
  • an independent four-cell table A with actual values is constructed for each word and all text categories included in the training set.
  • the chi-square value calculation formula can be used to calculate the chi-square value between each word and each category, and then determine the degree of relevance between the word and the category: the larger the chi-square value, the more divergent the two The larger the correlation, the lower the correlation; otherwise, the correlation is higher.
  • step c dividing the word vector into feature words and non-feature words according to the chi-square value means that a word vector with a chi-square value less than or equal to a preset value is used as a feature word, and the chi-square A word vector whose value is greater than a preset value is used as a non-feature word; or, the word vector is sorted according to the chi-square value in ascending order, and a preset number of word vectors ranked first is used as a feature word, sorted After the other word vectors as non-feature words.
  • step d calculating the frequency of using the feature words in the preset category refers to classifying the corpus into preset categories to obtain different types of text sets; and then calculating the feature words in each of the categories. The percentage of the category's text set.
  • step d assigning a first weight to the feature word and a second weight to the non-feature word according to the frequency of use refers to using the maximum value of the ratio as the feature word. To obtain the first weight value, and use the preset constant as the weight value of the non-feature word to obtain the second weight value.
  • step e the calculation method of the sentence vector is: for each word vector in the current text, if it is a feature word, the word vector of the feature word is multiplied by the corresponding first weight value and Accumulation; if it is a non-feature word, the word vector of the non-feature word is multiplied by the corresponding second weight value and accumulated; finally the sum of the word vectors obtained is divided by the sum of the weight values of all word vectors to obtain the sentence Vector; that is:
  • Sentence vector (word vector of feature word 1 * first weight 1 + word vector of feature word 2 * first weight 2 + ... + word vector of feature word m * first weight m + non-feature word 1 Word vector * second weight 1 + word vector of non-feature word 2 * second weight 2+ ... + word vector of non-feature word n * second weight n) / (first weight 1 + first Weight 2 + ... + first weight m + second weight 1 + second weight 2 + ... + second weight n)).
  • the present invention also provides a text classification method, which employs the chi-square test-based sentence vector calculation method described above, and classifies the current text according to the sentence vector; this implementation
  • the text classification method is to input a current text and a corresponding sentence vector into a classification model based on random forest intent recognition to perform prediction, and output the type of the current text.
  • the present invention also provides a text classification system, which includes:
  • Text preprocessing module which is used to perform word segmentation on the current text and remove stop words to obtain the word segmentation result
  • a word vector calculation module configured to calculate a word vector for each word in the word segmentation result
  • a chi-square test module configured to calculate a chi-square value between each word vector and a preset category, and divide the word vector into feature words and non-feature words according to the chi-square value;
  • a weight setting module which calculates a frequency of use of the feature word in the preset category, assigns a first weight to the feature word according to the frequency of use, and assigns a second weight to the non-feature word value;
  • a sentence vector calculation module which calculates a weighted average of all word vectors according to the feature words and the non-feature word word vectors and corresponding weight values, and uses the sentence vector as a sentence vector of the current text;
  • a text classification module that classifies the current text according to the sentence vector.
  • the text classification module inputs the current text and the corresponding sentence vector into a classification model based on the intention recognition of a random forest for prediction, and outputs the type of the current text.
  • the terms "including”, “including” or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements, but also Other elements not explicitly listed, or elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence “including a " do not exclude the existence of other identical elements in the process, method, article, or equipment including the elements.
  • a person of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be completed by hardware, or may be instructed by a program to perform related hardware.
  • the program may be stored in a computer-readable storage medium.
  • the above-mentioned storage medium may be a read-only memory, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于卡方检验的句向量计算方法、文本分类方法及系统,其通过对当前文本进行分词处理,并去除停用词,得到分词结果;计算所述分词结果中每个词的词向量;计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;且所述第一权值大于所述第二权值;根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量,从而提高了句向量在特征维度的权值,降低了文本信息中词向量间的相互干扰,极大的提高文本分类的准确性。

Description

基于卡方检验的句向量计算方法、文本分类方法及系统 技术领域
本发明涉及自然语言计算机自动处理技术领域,特别是一种基于卡方检验的句向量计算方法及其应用该方法的文本分类方法及系统。
背景技术
文本分类(Text categorization)是自然语言处理的一个重要步骤。文本分类是指在给定分类体系下,根据文本内容自动确定文本类别的过程.20世纪90年代以前,占主导地位的文本分类方法一直是基于知识工程的分类方法,即由专业人员手工进行分类.人工分类非常费时,效率非常低.90年代以来,众多的统计方法和机器学习方法应用于自动文本分类,文本分类技术的研究引起了研究人员的极大兴趣.目前在国内也已经开始对中文文本分类进行研究,并在信息检索、Web文档自动分类、数字图书馆、自动文摘、分类新闻组、文本过滤、单词语义辨析以及文档的组织和管理等多个领域得到了初步的应用.
早期常用的文本分类技术是通过利用独热编码(One-Hot Encoding)的方法制定一个足够长的向量,向量的每个维度代表一个词或短语。当句子中出现特定维度的所指向词汇后,向量在该维度的值为1,否则为0。通过该独热编码的方法能将句子转化为固定长度的向量,但这这种方法存在向量空间不确定、向量维度爆炸等问题,导致模型训练效率低下。
目前常用的方法是先利用中文文本分词技术将长文本分割成词语,再利用word2vec(词向量技术)将词语转化为一定维度的向量,利用句子中词向量的算术平均值作为句向量。但是,采用该句向量的算法,当句子中包含的相关词汇比较多时,可能导致句向量的特征不够明显,从而导致文本分类的 准确性较低。
发明内容
本发明为解决上述问题,提供了一种基于卡方检验的句向量计算方法、文本分类方法及系统,其通过对文本中的关键特征的加强,降低文本信息中词向量间的相互干扰,提高句向量在特征维度的权值,从而提高文本分类的准确性。
为实现上述目的,本发明采用的技术方案为:
一种基于卡方检验的句向量计算方法,其包括以下步骤:
a.对当前文本进行分词处理,并去除停用词,得到分词结果;
b.计算所述分词结果中每个词的词向量;
c.计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
d.计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;且所述第一权值大于所述第二权值;
e.根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量。
优选的,所述的步骤a中,还包括对所述当前文本进行上下文扩展得到扩展文本,再对所述扩展文本进行分词处理。
优选的,所述的步骤b中,是指利用训练好的词向量模型对所述分词结果进行计算词向量;所述词向量模型的训练是通过对训练语料进行分词处理和去除停用词得到分词结果,再将所述分词结果输入到所述词向量模型中进行训练,得到所述训练语料的每个词的词向量。
优选的,所述的步骤c中,计算每个词向量与预设类别之间的卡方值,所述预设类别是指利用预设分类算法对每个词向量进行类别识别,或者通过对每个词向量进行类别标注,得到每个词向量对应的所属类别。
优选的,所述的步骤c中,根据所述卡方值将所述词向量划分为特征词和非特征词,是指将卡方值小于或等于预设值的词向量作为特征词,并将卡方值大于预设值的词向量作为非特征词;或者,按照卡方值从小到大的顺序对所述词向量进行排序,并将排序在前的预设数量的词向量作为特征词,排序在后的其他词向量作为非特征词。
优选的,所述的步骤d中,计算所述特征词在所述预设类别中的使用频率,是指将语料库按照预设类别进行分类,得到不同类别的文本集;然后计算所述特征词在每个类别的文本集中所占的比例。
优选的,所述的步骤d中,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值,是指将所述比例的最大值作为所述特征词的权值,即得到所述第一权值;并将预设常数作为所述非特征词的权值,即得到所述第二权值。
优选的,所述的步骤e中,所述句向量的计算方法为:对于对于所述当前文本中的每个词向量,若为特征词,则将特征词的词向量乘以对应的第一权值并累加;若为非特征词,则将非特征词的词向量乘以对应的第二权值并累加;最后将得到的词向量之和除以所有词向量的权值之和,得到所述句向量;即:
句向量=(特征词1的词向量*第一权值1+特征词2的词向量*第一权值2+……+特征词m的词向量*第一权值m+非特征词1的词向量*第二权值1+非特征词2的词向量*第二权值2+……+非特征词n的词向量*第二权值n)/(第 一权值1+第一权值2+……+第一权值m+第二权值1+第二权值2+……+第二权值n))。
进一步的,本发明还提供一种文本分类方法,其采用上述任一项所述的基于卡方检验的句向量计算方法,并根据所述句向量对当前文本进行文本分类;即,将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
对应的,本发明还提供一种文本分类系统,其包括:
文本预处理模块,用于对当前文本进行分词处理,并去除停用词,得到分词结果;
词向量计算模块,用于计算所述分词结果中每个词的词向量;
卡方检验模块,用于计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
权值设置模块,其通过计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;
句向量计算模块,其根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量;
文本分类模块,其根据所述句向量对当前文本进行文本分类;即,将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
本发明的有益效果是:
(1)本发明通过对文本中的关键特征的加强,降低文本信息中词向量间的相互干扰,提高句向量在特征维度的权值,从而提高文本分类的准确性;
(2)本发明根据特征词与非特征词,并结合特征词的使用频率,对每个词向量的权值进行设置,从而提高句向量在特征维度的权值,减少无关词语扰动,从而提升句向量对句子语义的表达效果;
(3)本发明还进一步对所述当前文本进行上下文扩展得到扩展文本,再对所述扩展文本进行分词处理,通过增加上下文词语的权重来得到当前句子的向量,使得句向量对句子语义的表达效果更准确;
(4)本发明的文本分类方法是将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型,使得模型预测结果评价有明显提升。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为本发明一种基于卡方检验的句向量计算方法的流程简图;
图2为本发明一种文本分类方法的流程简图;
图3为本发明一种文本分类系统的结构示意图。
具体实施方式
为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图及实施例对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
如图1所示,本发明的一种基于卡方检验的句向量计算方法,其包括以下步骤:
a.对当前文本进行分词处理,并去除停用词,得到分词结果;
b.计算所述分词结果中每个词的词向量;
c.计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
d.计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;且所述第一权值大于所述第二权值;
e.根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量。
所述的步骤a中,还包括对所述当前文本进行上下文扩展得到扩展文本,再对所述扩展文本进行分词处理。优选的,将当前文本进行向上和/或向下扩展至三个句子以上。例如,若当前文本为文本的中间句子,则所述扩展文本包括当前句子、当前句子的上一个句子、当前句子的下一个句子;若当前文本为文本的第一个句子,则所述扩展文本包括当前句子和当前句子的下两个句子;若当前文本为文本的最后一个句子,则所述扩展文本包括当前句子和当前句子的上两个句子。所述的步骤a中,去除停用词,是通过查找停用词表,并将分词结果中存在于所述停用词表中的词语作为停用词去除;并且,进一步将去除停用词后的词语进行词性还原。所述的步骤a中,分词技术主要是采用结巴分词(Jieba),结巴分词是对中文句子按词语粒度进行拆分,支持三种分词模式:一是精确模式,试图将句子最精确地切开,适合文本分析;二是全模式,把句子中所有的可以成词的词语都扫描出来,速度非常快,但是不能解决歧义;三是搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。同时还能支持繁体分词和自定义词典。
所述的步骤b中,是指利用训练好的词向量模型对所述分词结果进行计算词向量;所述词向量模型的训练是通过对训练语料进行分词处理和去除停用词得到分词结果,再将所述分词结果输入到所述词向量模型中进行训练,得到所述训练语料的每个词的词向量。本实施例中,所述词向量模型采用word2vec模型,将训练语料的分词结果输入所述模型中进行训练,得到训练好的word2vec模型,使用时,再将当前文本的分词结果输入所述训练好的word2vec模型中,将分词结果转换为词向量集合。其中,word2vec也叫word embeddings,中文名“词向量”,作用就是将自然语言中的字词转为计算机可以理解的稠密向量(Dense Vector)。word2vec主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式。CBOW是从原始语句推测目标字词;而Skip-Gram正好相反,是从目标字词推测出原始语句。CBOW对小型数据库比较合适,而Skip-Gram在大型语料中表现更好,本领域技术人员可根据实际需要进行选择所需的模式。
所述的步骤c中,计算每个词向量与预设类别之间的卡方值,所述预设类别是指利用预设分类算法对每个词向量进行类别识别,或者通过对每个词向量进行类别标注,得到每个词向量对应的所属类别。其中,所述预设的分类算法可采用朴素贝叶斯(Naive Bayes,NB)、决策树(Decision Tree,DT)、K近邻(K-nearest neighbors,KNN)等等。也可采用直接对所述词向量进行人工标注所属类别。
卡方检验(Chi-square test)是现代统计学的创始人之一,英国人K.Pearson(1857-1936)于1900年提出的一种具有广泛用途的统计方法,可用于两个或多个率间的比较,计数资料的关联度分析,拟合优度检验等等。本实施例中,对于已完成类别标注的训练集文本,针对每个词语及训练集中所 包含的所有文本类别分别构造一个实际值独立四格表A。
表A如下所示::
组别 属于类别C i 不属于类别C i 合计
包含词语W i a b a+b
不包含词语W i c d c+d
合计 a+c b+d n=a+b+c+d
因此,属于类别Ci的概率P=(a+c)/n,同时,可以计算得出理论值独立四格表T。
表T如下所示:
组别 属于类别C i 不属于类别C i
包含词语W i ɑ=(a+b)*P β=(a+b)*(1-P)
不包含词语W i θ=(c+d)*P γ=(c+d)*(1-P)
独立四格表卡方值计算公式:
Figure PCTCN2019097187-appb-000001
利用卡方值计算公式可以计算出与每个词语与每个类别之间的卡方值,进而判断该词语与类别之间的关联度:卡方值越大,则说明二者之间发散性越大,相关度越低;反之则说明二者相关度较高。
所述的步骤c中,根据所述卡方值将所述词向量划分为特征词和非特征词,是指将卡方值小于或等于预设值的词向量作为特征词,并将卡方值大于预设值的词向量作为非特征词;或者,按照卡方值从小到大的顺序对所述词向量进行排序,并将排序在前的预设数量的词向量作为特征词,排序在后的其他词向量作为非特征词。
所述的步骤d中,计算所述特征词在所述预设类别中的使用频率,是指将语料库按照预设类别进行分类,得到不同类别的文本集;然后计算所述特征词在每个类别的文本集中所占的比例。
所述的步骤d中,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值,是指将所述比例的最大值作为所述特征词的权值,即得到所述第一权值;并将预设常数作为所述非特征词的权值,即得到所述第二权值。
所述的步骤e中,所述句向量的计算方法为:对于对于所述当前文本中的每个词向量,若为特征词,则将特征词的词向量乘以对应的第一权值并累加;若为非特征词,则将非特征词的词向量乘以对应的第二权值并累加;最后将得到的词向量之和除以所有词向量的权值之和,得到所述句向量;即:
句向量=(特征词1的词向量*第一权值1+特征词2的词向量*第一权值2+……+特征词m的词向量*第一权值m+非特征词1的词向量*第二权值1+非特征词2的词向量*第二权值2+……+非特征词n的词向量*第二权值n)/(第一权值1+第一权值2+……+第一权值m+第二权值1+第二权值2+……+第二权值n))。
如图2所示,本发明还提供一种文本分类方法,其采用上述任一项所述的基于卡方检验的句向量计算方法,并根据所述句向量对当前文本进行文本分类;本实施例中,所述文本分类方法是将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
如图3所示,本发明还提供一种文本分类系统,其包括:
文本预处理模块,用于对当前文本进行分词处理,并去除停用词,得到分词结果;
词向量计算模块,用于计算所述分词结果中每个词的词向量;
卡方检验模块,用于计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
权值设置模块,其通过计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;
句向量计算模块,其根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量;
文本分类模块,其根据所述句向量对当前文本进行文本分类。
本实施例中,所述文本分类模块是将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于系统实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
并且,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。另外,本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中, 上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述说明示出并描述了本发明的优选实施例,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文发明构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。

Claims (10)

  1. 一种基于卡方检验的句向量计算方法,其特征在于,包括以下步骤:
    a.对当前文本进行分词处理,并去除停用词,得到分词结果;
    b.计算所述分词结果中每个词的词向量;
    c.计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
    d.计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;且所述第一权值大于所述第二权值;
    e.根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量。
  2. 根据权利要求1所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤a中,还包括对所述当前文本进行上下文扩展得到扩展文本,再对所述扩展文本进行分词处理。
  3. 根据权利要求1所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤b中,是指利用训练好的词向量模型对所述分词结果进行计算词向量;所述词向量模型的训练是通过对训练语料进行分词处理和去除停用词得到分词结果,再将所述分词结果输入到所述词向量模型中进行训练,得到所述训练语料的每个词的词向量。
  4. 根据权利要求1所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤c中,计算每个词向量与预设类别之间的卡方值,所述预设类别是指利用预设分类算法对每个词向量进行类别识别,或者通过对每个词向量进行类别标注,得到每个词向量对应的所属类别。
  5. 根据权利要求1或4所述的一种基于卡方检验的句向量计算方法,其 特征在于:所述的步骤c中,根据所述卡方值将所述词向量划分为特征词和非特征词,是指将卡方值小于或等于预设值的词向量作为特征词,并将卡方值大于预设值的词向量作为非特征词;或者,按照卡方值从小到大的顺序对所述词向量进行排序,并将排序在前的预设数量的词向量作为特征词,排序在后的其他词向量作为非特征词。
  6. 根据权利要求1所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤d中,计算所述特征词在所述预设类别中的使用频率,是指将语料库按照预设类别进行分类,得到不同类别的文本集;然后计算所述特征词在每个类别的文本集中所占的比例。
  7. 根据权利要求6所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤d中,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值,是指将所述比例的最大值作为所述特征词的权值,即得到所述第一权值;并将预设常数作为所述非特征词的权值,即得到所述第二权值。
  8. 根据权利要求1所述的一种基于卡方检验的句向量计算方法,其特征在于:所述的步骤e中,所述句向量的计算方法为:对于对于所述当前文本中的每个词向量,若为特征词,则将特征词的词向量乘以对应的第一权值并累加;若为非特征词,则将非特征词的词向量乘以对应的第二权值并累加;最后将得到的词向量之和除以所有词向量的权值之和,得到所述句向量;即:
    句向量=(特征词1的词向量*第一权值1+特征词2的词向量*第一权值2+……+特征词m的词向量*第一权值m+非特征词1的词向量*第二权值1+非特征词2的词向量*第二权值2+……+非特征词n的词向量*第二权值n)/(第一权值1+第一权值2+……+第一权值m+第二权值1+第二权值2+……+第二权 值n))。
  9. 一种文本分类方法,其特征在于,采用权利要求1至7任一项所述的基于卡方检验的句向量计算方法,并根据所述句向量对当前文本进行文本分类;即,将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
  10. 一种文本分类系统,其特征在于,包括:
    文本预处理模块,用于对当前文本进行分词处理,并去除停用词,得到分词结果;
    词向量计算模块,用于计算所述分词结果中每个词的词向量;
    卡方检验模块,用于计算每个词向量与预设类别之间的卡方值,并根据所述卡方值将所述词向量划分为特征词和非特征词;
    权值设置模块,其通过计算所述特征词在所述预设类别中的使用频率,根据所述使用频率对所述特征词赋予第一权值,并对所述非特征词赋予第二权值;且所述第一权值大于所述第二权值;
    句向量计算模块,其根据所述特征词和所述非特征词的词向量及对应的权值,计算所有词向量的加权平均值,作为当前文本的句向量;
    文本分类模块,其根据所述句向量对当前文本进行文本分类;即,将当前文本及对应的句向量输入基于随机森林的意图识别分类模型中进行预测,并输出当前文本的所属类型。
PCT/CN2019/097187 2018-09-27 2019-07-23 基于卡方检验的句向量计算方法、文本分类方法及系统 WO2020063071A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811130081.1 2018-09-27
CN201811130081.1A CN109522544A (zh) 2018-09-27 2018-09-27 基于卡方检验的句向量计算方法、文本分类方法及系统

Publications (1)

Publication Number Publication Date
WO2020063071A1 true WO2020063071A1 (zh) 2020-04-02

Family

ID=65769917

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/097187 WO2020063071A1 (zh) 2018-09-27 2019-07-23 基于卡方检验的句向量计算方法、文本分类方法及系统

Country Status (2)

Country Link
CN (1) CN109522544A (zh)
WO (1) WO2020063071A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522544A (zh) * 2018-09-27 2019-03-26 厦门快商通信息技术有限公司 基于卡方检验的句向量计算方法、文本分类方法及系统
CN110210030B (zh) * 2019-05-31 2021-02-09 腾讯科技(深圳)有限公司 语句分析的方法及装置
CN110264318A (zh) * 2019-06-26 2019-09-20 拉扎斯网络科技(上海)有限公司 数据处理方法、装置、电子设备及存储介质
CN113626587B (zh) * 2020-05-08 2024-03-29 武汉金山办公软件有限公司 一种文本类别识别方法、装置、电子设备及介质
CN111523301B (zh) * 2020-06-05 2023-05-05 泰康保险集团股份有限公司 合同文档合规性检查方法及装置
CN112035637A (zh) * 2020-08-28 2020-12-04 康键信息技术(深圳)有限公司 医学领域意图识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014002775A1 (ja) * 2012-06-25 2014-01-03 日本電気株式会社 同義語抽出システム、方法および記録媒体
CN107229610A (zh) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 一种情感数据的分析方法及装置
CN107622333A (zh) * 2017-11-02 2018-01-23 北京百分点信息科技有限公司 一种事件预测方法、装置及系统
CN107818077A (zh) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 一种敏感内容识别方法及装置
CN109522544A (zh) * 2018-09-27 2019-03-26 厦门快商通信息技术有限公司 基于卡方检验的句向量计算方法、文本分类方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964B (zh) * 2011-01-13 2012-05-09 北京邮电大学 基于块划分及位置权重的文本分类方法
WO2014141452A1 (ja) * 2013-03-14 2014-09-18 株式会社 東芝 文書分析装置及び文書分析プログラム
CN104156452A (zh) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 一种网页文本摘要生成方法和装置
CN105183831A (zh) * 2015-08-31 2015-12-23 上海德唐数据科技有限公司 一种针对不同学科题目文本分类的方法
CN105224695B (zh) * 2015-11-12 2018-04-20 中南大学 一种基于信息熵的文本特征量化方法和装置及文本分类方法和装置
CN106897428B (zh) * 2017-02-27 2022-08-09 腾讯科技(深圳)有限公司 文本分类特征提取方法、文本分类方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014002775A1 (ja) * 2012-06-25 2014-01-03 日本電気株式会社 同義語抽出システム、方法および記録媒体
CN107818077A (zh) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 一种敏感内容识别方法及装置
CN107229610A (zh) * 2017-03-17 2017-10-03 咪咕数字传媒有限公司 一种情感数据的分析方法及装置
CN107622333A (zh) * 2017-11-02 2018-01-23 北京百分点信息科技有限公司 一种事件预测方法、装置及系统
CN109522544A (zh) * 2018-09-27 2019-03-26 厦门快商通信息技术有限公司 基于卡方检验的句向量计算方法、文本分类方法及系统

Also Published As

Publication number Publication date
CN109522544A (zh) 2019-03-26

Similar Documents

Publication Publication Date Title
WO2020063071A1 (zh) 基于卡方检验的句向量计算方法、文本分类方法及系统
CN110825877A (zh) 一种基于文本聚类的语义相似度分析方法
CN107577785B (zh) 一种适用于法律识别的层次多标签分类方法
CN107704892B (zh) 一种基于贝叶斯模型的商品编码分类方法以及系统
WO2017101342A1 (zh) 情感分类方法及装置
CN109408743B (zh) 文本链接嵌入方法
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN109670014B (zh) 一种基于规则匹配和机器学习的论文作者名消歧方法
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
JP2011227688A (ja) テキストコーパスにおける2つのエンティティ間の関係抽出方法及び装置
US20200073890A1 (en) Intelligent search platforms
CN116501875B (zh) 一种基于自然语言和知识图谱的文档处理方法和系统
CN112307182A (zh) 一种基于问答系统的伪相关反馈的扩展查询方法
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN114491062B (zh) 一种融合知识图谱和主题模型的短文本分类方法
CN115168580A (zh) 一种基于关键词提取与注意力机制的文本分类方法
CN110968693A (zh) 基于集成学习的多标签文本分类计算方法
CN114265935A (zh) 一种基于文本挖掘的科技项目立项管理辅助决策方法及系统
CN112417082B (zh) 一种科研成果数据消歧归档存储方法
CN113032573A (zh) 一种结合主题语义与tf*idf算法的大规模文本分类方法及系统
Villegas et al. Vector-based word representations for sentiment analysis: a comparative study
CN110348497B (zh) 一种基于WT-GloVe词向量构建的文本表示方法
CN117057346A (zh) 一种基于加权TextRank和K-means的领域关键词抽取方法
CN115098690A (zh) 一种基于聚类分析的多数据文档分类方法及系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19867163

Country of ref document: EP

Kind code of ref document: A1