CN108804512B

CN108804512B - Apparatus, method and computer-readable storage medium for generating text classification model

Info

Publication number: CN108804512B
Application number: CN201810361702.0A
Authority: CN
Inventors: 王健宗; 吴天博; 黄章成; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2020-11-24
Anticipated expiration: 2038-04-20
Also published as: CN108804512A; WO2019200806A1

Abstract

The invention discloses a text classification model generating device, comprising a memory and a processor. The memory stores a model generating program that can be run on the processor. When the program is executed by the processor, the following steps are implemented: obtaining word segmentation in the financial field. Dictionary and text corpus in the financial field; select candidate new words from the text corpus to add to the word segmentation dictionary; obtain a sample set and label the training samples in the sample set; use the preset word segmentation based on the word segmentation dictionary with candidate new words added The algorithm performs word segmentation on the training samples in the sample set and extracts word vectors. Based on the adaboost algorithm, the word vectors and labeled category information are input into multiple weak classifiers for training to obtain a text classification model. The invention also provides a method for generating a text classification model and a computer-readable storage medium. The invention solves the problem in the prior art that the emotional tendency classification of texts in the financial field cannot be realized.

Description

Apparatus, method and computer-readable storage medium for generating text classification model

技术领域technical field

本发明涉及文本分类技术领域，尤其涉及一种文本分类模型的生成装置、方法及计算机可读存储介质。The present invention relates to the technical field of text classification, and in particular, to an apparatus, method and computer-readable storage medium for generating a text classification model.

背景技术Background technique

随着互联网和信息技术的发展，越来越多的机构和个人通过互联网途径以各种方式发表对各种事物的观点、态度和立场，如各种新闻评论、论坛以及社交网站等。这些海量的信息对于电子商务、市场预测等各个方面具有一定的商业价值，特别是金融行业，是互联网信息增长最快，受影响最大的行业，因此，对金融文本信息进行情感倾向分析以进行更加深入的研究逐渐成为重要课题。With the development of the Internet and information technology, more and more institutions and individuals express their views, attitudes and positions on various things in various ways through the Internet, such as various news reviews, forums and social networking sites. These massive amounts of information have certain commercial value for e-commerce, market forecasting and other aspects, especially the financial industry, which is the industry with the fastest growth and the greatest impact on Internet information. In-depth research has gradually become an important topic.

文本情感倾向性分析是属于文本情感分析的一部分，通过情感倾向性分析，可以掌握本文的褒贬性倾向，对于金融领域来说，新闻舆情是体现市场和行业的景气程度以及投资者的交易热情的重要指标，因此，对金融领域的文本的情感倾向性的分析对于金融时长的研究具有剧组轻重的影响，但是现有技术中还缺乏实现对金融领域文本进行情感倾向的分类的方案，导致无法实现对金融领域文本进行情感倾向性的分类。Text sentiment orientation analysis is a part of text sentiment analysis. Through sentiment orientation analysis, we can grasp the positive and negative tendencies of this article. For the financial field, news public opinion reflects the prosperity of the market and industry and the enthusiasm of investors for trading. An important indicator, therefore, the analysis of the emotional tendencies of texts in the financial field has a significant impact on the study of financial duration. However, there is still a lack of solutions for classifying emotional tendencies of texts in the financial field in the prior art, which makes it impossible to achieve Sentiment-oriented classification of texts in the financial domain.

发明内容SUMMARY OF THE INVENTION

本发明提供一种文本分类模型的生成装置、方法及计算机可读存储介质，其主要目的在于提出一种可以用于金融领域文本的情感倾向分类的文本分类模型的生成装置，以解决现有技术中无法实现对金融领域文本进行情感倾向性的分类的问题。The present invention provides a text classification model generation device, method and computer-readable storage medium, the main purpose of which is to propose a text classification model generation device that can be used for emotional tendency classification of texts in the financial field, so as to solve the problem of the prior art In this paper, it is impossible to realize the classification of sentimental tendency of texts in the financial field.

为实现上述目的，本发明提供一种文本分类模型的生成装置，该装置包括存储器和处理器，所述存储器中存储有可在所述处理器上运行的模型生成程序，所述模型生成程序被所述处理器执行时实现如下步骤：In order to achieve the above object, the present invention provides a text classification model generation device, the device includes a memory and a processor, the memory stores a model generation program that can be run on the processor, and the model generation program is The processor implements the following steps when executing:

获取基于收集的金融领域词汇构建的金融领域的分词词典，以及预设的金融领域的文本语料；Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

根据预设算法从所述文本语料中选择候选新词，添加至所述分词词典；According to a preset algorithm, candidate new words are selected from the text corpus and added to the word segmentation dictionary;

获取样本集，按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注；Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

基于添加了候选新词的所述分词词典，使用预设的分词算法对所述样本集中的训练样本进行分词处理；Based on the word segmentation dictionary to which candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set;

根据分词结果提取词向量，基于adaboost算法，将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练，将训练得到的多个弱分类器组合为金融领域的文本分类模型。The word vector is extracted according to the word segmentation result. Based on the adaboost algorithm, the word vector corresponding to the training sample and the labeled category information are input into the preset multiple weak classifiers for training, and the multiple weak classifiers obtained by training are combined into the financial field. text classification model.

可选地，所述根据预设算法从所述文本语料中选择候选新词，添加至所述分词词典的步骤包括：Optionally, the step of selecting candidate new words from the text corpus according to a preset algorithm, and adding them to the word segmentation dictionary includes:

基于所述分词词典，使用所述分词算法对所述文本语料进行分词处理，根据所述分词结果获取候选词集合；Based on the word segmentation dictionary, the word segmentation algorithm is used to perform word segmentation processing on the text corpus, and a candidate word set is obtained according to the word segmentation result;

计算所述候选词集合中各个候选词的信息增益，选择信息增益大于第一预设阈值的候选词作为第一候选新词，将所述第一候选新词添加到所述分词词典中；Calculate the information gain of each candidate word in the candidate word set, select a candidate word whose information gain is greater than a first preset threshold as the first candidate new word, and add the first candidate new word to the word segmentation dictionary;

基于添加了所述第一候选新词的分词词典，使用所述分词算法对所述文本语料进行分词，并使用分词处理后的文本语料训练词向量模型；Based on the word segmentation dictionary to which the first candidate new word is added, the word segmentation algorithm is used to segment the text corpus, and the word vector model is trained using the segmented text corpus;

使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度；Using the word vector model obtained by training to calculate the semantic similarity between the word in the word segmentation result and the first candidate new word;

将语义相似度大于第二预设阈值的词作为第二候选新词，并将所述第二候选新词添加到所述分词词典中。A word with a semantic similarity greater than a second preset threshold is used as a second candidate new word, and the second candidate new word is added to the word segmentation dictionary.

可选地，所述处理器还可用于执行所述模型生成程序，以在所述将语义相似度大于第二预设阈值的词作为第二候选新词，并将所述第二候选新词添加到所述词词典的步骤之后，还实现如下步骤：Optionally, the processor can also be configured to execute the model generation program, so that the word whose semantic similarity is greater than the second preset threshold is used as the second candidate new word, and the second candidate new word is used as the second candidate new word. After the step of adding to the word dictionary, the following steps are also implemented:

计算所述第二候选新词在文本语料中的词频，并将计算得到的词频作为该第二候选新词在所述分词词典中的权重。Calculate the word frequency of the second candidate new word in the text corpus, and use the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.

可选地，所述获取样本集，按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注的步骤包括：Optionally, in the acquisition of the sample set, the step of labeling the training samples in the sample set according to a preset emotional tendency classification mode includes:

获取样本集，并获取多个标注人按照预设情感倾向分类模式对样本集中的训练样本进行标注得到的多个标注信息，从所述多个标注信息中，选择出现次数最多的标注信息作为对应的训练样本的标注结果。Obtain a sample set, and obtain a plurality of labeling information obtained by labeling the training samples in the sample set by a plurality of labelers according to the preset sentiment tendency classification mode, and select the labeling information with the most occurrences as the corresponding labeling information from the plurality of labeling information The annotation results of the training samples.

可选地，所述弱分类器包括基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器。Optionally, the weak classifier includes a classifier based on a convolutional neural network algorithm, a classifier based on a recurrent neural network algorithm, and a classifier based on a long short-term memory network algorithm.

此外，为实现上述目的，本发明还提供一种文本分类模型的生成方法，该方法包括：In addition, in order to achieve the above purpose, the present invention also provides a method for generating a text classification model, the method comprising:

可选地，所述将语义相似度大于第二预设阈值的词作为第二候选新词，并将所述第二候选新词添加到所述词词典的步骤之后，所述方法还包括步骤：Optionally, after the step of taking a word with a semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, the method further includes the step of: :

此外，为实现上述目的，本发明还提供一种计算机可读存储介质，所述计算机可读存储介质上存储有模型生成程序，所述模型生成程序可被一个或者多个处理器执行，以实现如上所述的文本分类模型的生成方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, where a model generation program is stored on the computer-readable storage medium, and the model generation program can be executed by one or more processors to achieve The steps of the generation method of the text classification model as described above.

本发明提出的文本分类模型的生成装置、方法及计算机可读存储介质，基于收集的金融领域词汇构建金融领域的分词词典，获取预设的金融领域的文本语料，根据文本语料获取候选词集合，从候选词集合中选择候选新词添加至分词词典。获取样本集，按照预设情感倾向分类模式对样本集中的训练样本进行类别标注。基于添加了候选新词的分词词典，使用预设的分词算法对样本集中的训练样本进行分词处理，根据分词结果提取词向量，基于adaboost算法，将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练，将训练得到的多个弱分类器组合为金融领域的文本分类模型。本发明的方案中，通过对金融领域的文本语料挖掘，筛选出候选新词添加到分词词典中，通过更新后的分词词典对样本集中的样本分词处理，并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注，最终训练得到文本分类模型，该模型能够应用于金融领域的情感倾向分类问题。The device, method and computer-readable storage medium for generating a text classification model proposed by the present invention construct a word segmentation dictionary in the financial field based on the collected vocabulary in the financial field, obtain a preset text corpus in the financial field, and obtain a candidate word set according to the text corpus, Select candidate new words from the candidate word set to add to the word segmentation dictionary. Obtain a sample set, and label the training samples in the sample set according to the preset sentiment tendency classification mode. Based on the word segmentation dictionary with the candidate new words added, the preset word segmentation algorithm is used to segment the training samples in the sample set, and word vectors are extracted according to the word segmentation results. Based on the adaboost algorithm, the word vectors corresponding to the training samples and the labeled category information are input Training is performed on multiple preset weak classifiers, and the multiple weak classifiers obtained by training are combined into a text classification model in the financial field. In the solution of the present invention, the candidate new words are selected and added to the word segmentation dictionary by mining the text corpus in the financial field, the word segmentation of the samples in the sample set is processed by the updated word segmentation dictionary, and the samples are classified according to the preset emotional tendency classification mode. The centralized sample data is classified into categories, and finally a text classification model is obtained by training, which can be applied to the classification of emotional tendencies in the financial field.

附图说明Description of drawings

图1为本发明文本分类模型的生成装置较佳实施例的示意图；1 is a schematic diagram of a preferred embodiment of a device for generating a text classification model according to the present invention;

图2为本发明文本分类模型的生成装置一实施例中模型生成程序的程序模块示意图；2 is a schematic diagram of a program module of a model generation program in an embodiment of a text classification model generation device of the present invention;

图3为本发明文本分类模型的生成方法较佳实施例的流程图。FIG. 3 is a flowchart of a preferred embodiment of a method for generating a text classification model according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提供一种文本分类模型的生成装置。参照图1所示，为本发明文本分类模型的生成装置较佳实施例的示意图。The present invention provides a device for generating a text classification model. Referring to FIG. 1 , it is a schematic diagram of a preferred embodiment of an apparatus for generating a text classification model according to the present invention.

在本实施例中，文本分类模型的生成装置可以是PC(Personal Computer，个人电脑)，也可以是智能手机、平板电脑、便携计算机等终端设备。该文本分类模型的生成装置1至少包括存储器11、处理器12，通信总线13，以及网络接口14。In this embodiment, the device for generating the text classification model may be a PC (Personal Computer, personal computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The generating apparatus 1 of the text classification model includes at least a memory 11 , a processor 12 , a communication bus 13 , and a network interface 14 .

其中，存储器11至少包括一种类型的可读存储介质，所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是文本分类模型的生成装置1的内部存储单元，例如该文本分类模型的生成装置1的硬盘。存储器11在另一些实施例中也可以是文本分类模型的生成装置1的外部存储设备，例如文本分类模型的生成装置1上配备的插接式硬盘，智能存储卡(SmartMedia Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。进一步地，存储器11还可以既包括文本分类模型的生成装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于文本分类模型的生成装置1的应用软件及各类数据，例如模型生成程序01的代码等，还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the apparatus 1 for generating a text classification model, such as a hard disk of the apparatus 1 for generating a text classification model. In other embodiments, the memory 11 may also be an external storage device of the text classification model generating apparatus 1, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a security Digital (Secure Digital, SD) card, flash memory card (Flash Card) and so on. Further, the memory 11 may also include both an internal storage unit of the text classification model generating apparatus 1 and an external storage device. The memory 11 can be used not only to store application software installed in the text classification model generating apparatus 1 and various types of data, such as the code of the model generating program 01 , but also to temporarily store data that has been output or will be output.

处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片，用于运行存储器11中存储的程序代码或处理数据，例如执行模型生成程序01等。The processor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor or other data processing chips in some embodiments, for running program codes or processing stored in the memory 11 Data, such as the execution model generation program 01, etc.

通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection communication between these components.

网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口)，通常用于在该装置1与其他电子设备之间生成通信连接。The network interface 14 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is usually used to generate a communication connection between the apparatus 1 and other electronic devices.

图1仅示出了具有组件11-14以及模型生成程序01的文本分类模型的生成装置1，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。FIG. 1 only shows the generating apparatus 1 of the text classification model with components 11-14 and the model generating program 01, but it should be understood that it is not required to implement all the shown components, and more or less may be implemented instead. s component.

可选地，该装置1还可以包括用户接口，用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard)，可选的用户接口还可以包括标准的有线接口、无线接口。可选地，在一些实施例中，显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。其中，显示器也可以适当的称为显示屏或显示单元，用于显示在文本分类模型的生成装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, for displaying the information processed in the text classification model generating apparatus 1 and for displaying a visual user interface.

在图1所示的装置1实施例中，存储器11中存储有模型生成程序01；处理器12执行存储器11中存储的模型生成程序01时实现如下步骤：In the embodiment of the device 1 shown in FIG. 1 , the model generation program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the model generation program 01 stored in the memory 11:

A1、获取基于收集的金融领域词汇构建的金融领域的分词词典，以及预设的金融领域的文本语料。A1. Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field.

首先，获取全领域分词词典，在全领域分词词典的基础上，加入收集的金融领域的词汇构成金融领域分词词典。其中，金融领域的词汇来源主要包括以下三类：金融领域专业术语，例如“威廉指标”、“移动平均线”、“可转债”等；金融论坛用语，如一些炒股论坛中的用户在评论股票时的用语；应用于金融领域的网络用语和特定符号，例如“垃圾股”等。First, obtain a word segmentation dictionary in all fields, and on the basis of the word segmentation dictionary in all fields, add the collected vocabulary in the financial field to form a word segmentation dictionary in the financial field. Among them, the sources of vocabulary in the financial field mainly include the following three categories: professional terms in the financial field, such as "William's indicator", "moving average", "convertible bond", etc.; financial forum terms, such as some users in stock speculation forums commenting on A term for stocks; Internet terms and specific symbols applied to the financial sector, such as "junk stocks", etc.

A2、根据预设算法从所述文本语料中选择候选新词，添加至所述分词词典。A2. Select candidate new words from the text corpus according to a preset algorithm, and add them to the word segmentation dictionary.

在上述分词词典的基础上，从新的文本预料中选择候选新词添加到当前的分词词典中。具体地，步骤A2包括：On the basis of the above word segmentation dictionary, candidate new words are selected from the new text prediction and added to the current word segmentation dictionary. Specifically, step A2 includes:

A21、基于所述分词词典，使用所述分词算法对所述文本语料进行分词处理，根据所述分词结果获取候选词集合；A22、计算所述候选词集合中各个候选词的信息增益，选择信息增益大于第一预设阈值的候选词作为第一候选新词，将所述第一候选新词添加到所述分词词典中；A23、基于添加了所述第一候选新词的分词词典，使用所述分词算法对所述文本语料进行分词，并使用分词处理后的文本语料训练词向量模型；A24、使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度；A25、将语义相似度大于第二预设阈值的词作为第二候选新词，并将所述第二候选新词添加到所述分词词典中。A21. Based on the word segmentation dictionary, use the word segmentation algorithm to perform word segmentation on the text corpus, and obtain a candidate word set according to the word segmentation result; A22, calculate the information gain of each candidate word in the candidate word set, and select information A candidate word whose gain is greater than the first preset threshold is used as the first candidate new word, and the first candidate new word is added to the word segmentation dictionary; A23. Based on the word segmentation dictionary to which the first candidate new word is added, use The word segmentation algorithm performs word segmentation on the text corpus, and uses the word segmentation processed text corpus to train a word vector model; A24, using the word vector model obtained by training to calculate the semantics of the word in the word segmentation result and the first candidate new word Similarity; A25. Use a word whose semantic similarity is greater than a second preset threshold as a second candidate new word, and add the second candidate new word to the word segmentation dictionary.

获取用于扩充分词词典的文本语料。具体地，采用网络爬虫的方式从金融网站抓取大量的、与待分析的金融主题相关的金融新闻文本信息，形成文本语料。对爬取到的数据进行预处理，去除其中包含的乱码符号、web转义符号等无用信息，保留文本数据作为文本语料。接下来，通过人工标注的方式对文本语料中的大量文本数据进行情感倾向的分类，即为文本数据添加类别标注信息。Get the text corpus used to augment the word segmentation dictionary. Specifically, a large amount of financial news text information related to the financial topic to be analyzed is captured from a financial website by means of a web crawler to form a text corpus. Preprocess the crawled data, remove useless information such as garbled symbols and web escape symbols, and retain text data as text corpus. Next, classify the sentiment tendency of a large amount of text data in the text corpus by manual annotation, that is, add category annotation information to the text data.

以当前的分词词典作为预设的分词算法的词典，对文本语料进行分词处理，然后，根据预设的停用词词表过滤掉分词结果中的停用词，以去掉结果中的无关词汇，由剩余的分词结果构成候选词集合。分词结果对应的类别标注信息与其对应的文本数据的类别标注信息一致。Using the current word segmentation dictionary as the dictionary of the preset word segmentation algorithm, the word segmentation process is performed on the text corpus, and then the stop words in the word segmentation result are filtered according to the preset stop word list to remove irrelevant words in the result. A candidate word set is formed from the remaining word segmentation results. The category annotation information corresponding to the word segmentation result is consistent with the category annotation information of the corresponding text data.

接下来，根据信息增益从候选词集合中选择候选新词，其中，信息增益是一种基于熵的评估方法，其用于特征选择时，衡量的是某个词的出现与否对判断一个文本是否属于某个类所提供的信息量；其定义为某一特征值在文档中出现前后的信息量之差，计算公式为：Next, candidate new words are selected from the candidate word set according to the information gain. Information gain is an entropy-based evaluation method. When it is used for feature selection, it measures whether the appearance of a word is useful for judging a text. Whether it belongs to the amount of information provided by a certain class; it is defined as the difference between the amount of information before and after a feature value appears in the document, and the calculation formula is:

上述公式中，P(C_j)表示类别C_j在数据集中出现的概率，P(t_i)表示特征项t_i出现在数据集中的概率，P(C_j|t_i)表示特征项t_i出现在判定为类别C_j的文档中的概率，

表示特征项t_i不出现的概率，

表示特征项t_i出现在不属于类别C_j的文档中的概率，|c|为类别的总数。其中，类别是指情感倾向的分类，特征项是候选词。上述概率值都可以通过对候选词在文本语料中的统计情况计算得到。In the above formula, P(C _j ) represents the probability that the category C _j appears in the data set, P(t _i ) represents the probability that the feature item t _i appears in the data set, and P(C _j |t _i ) represents the feature item t _i the probability of appearing in a document judged to be class C _j ,

represents the probability that the feature item t _i does not appear,

Represents the probability that the feature item t _i appears in the document that does not belong to the category C _j , |c| is the total number of categories. Among them, the category refers to the classification of emotional tendencies, and the feature items are candidate words. The above probability values can be obtained by calculating the statistics of the candidate words in the text corpus.

根据计算得到的信息增益判断候选词的有用程度，信息增益的值越大，则对分类越有用。将候选词集合中信息增益大于第一预设阈值的候选词作为第一候选新词，添加到当前的分词词典中，实现对分词词典的扩充。The usefulness of the candidate word is judged according to the calculated information gain. The larger the value of the information gain, the more useful it is for classification. A candidate word whose information gain is greater than the first preset threshold in the candidate word set is added to the current word segmentation dictionary as the first candidate new word to realize the expansion of the word segmentation dictionary.

基于上述经过词汇扩充的分词词典，使用同样的分词算法对上述同样的文本语料进行分词处理，获取分词结果，使用分词处理后的文本语料训练词向量模型，使用训练得到的词向量模型计算分词得到的各个词的词向量，根据词向量计算分词处理得到的词语第一候选新词的语义相似度，若语义相似度大于第二预设阈值，则将其作为第二候选新词，将从分词结果中选择出的第二候选词添加到分词词典中，实现对分词词典的再次扩充。Based on the above word segmentation dictionary after vocabulary expansion, use the same word segmentation algorithm to perform word segmentation processing on the same text corpus, obtain word segmentation results, use the word segmentation processed text corpus to train a word vector model, and use the trained word vector model to calculate word segmentation to get The word vector of each word is calculated according to the word vector, and the semantic similarity of the first candidate new word of the word obtained by the word segmentation process is calculated. If the semantic similarity is greater than the second preset threshold, it will be used as the second candidate new word. The second candidate word selected in the result is added to the word segmentation dictionary to realize the re-expansion of the word segmentation dictionary.

可以理解的是，在使用分词算法对文本语料进行分词处理后，会通过停用词词表删除掉分词结果中的停用词，因为这些停用词噪音大，且对文本分类无意义，删除这些词可以提高文本分类的准确程度，同时减小选择候选新词时的计算量。It is understandable that after the word segmentation algorithm is used to segment the text corpus, the stop words in the word segmentation result will be deleted through the stop word list, because these stop words are noisy and meaningless for text classification, so delete them. These words can improve the accuracy of text classification, while reducing the amount of calculation when selecting candidate new words.

经过上述步骤，实际上实现了对分词词典的三次扩充，第一次是通过人工收集的方式获取金融领域词汇的方式进行初步的扩充，第二次是通过计算信息增益选择新词，第三次是通过词向量计算语义相似度再次选择新词。此外，第二次和第三次扩充均是在上一次扩充后的分词词典的基础上进行的再次扩充。通过上述动态扩充的方式，尽可能多地从语料中筛选新的金融领域词汇。扩充了分词词典的分词算法用于分类模型的训练样本的分词，分词词典中的金融领域词汇越丰富，则对金融领域文本的分词结果越准确，训练得到的分类模型的分类准确度也越高。After the above steps, three expansions of the word segmentation dictionary are actually realized. The first is to obtain the vocabulary in the financial field through manual collection for preliminary expansion. The second is to select new words by calculating the information gain. The third It is to calculate the semantic similarity through the word vector and select new words again. In addition, the second and third expansions are re-expands based on the word segmentation dictionary after the previous expansion. Through the above dynamic expansion method, as many new financial domain words as possible are selected from the corpus. The word segmentation algorithm of the expanded word segmentation dictionary is used for word segmentation of the training samples of the classification model. The richer the financial domain vocabulary in the word segmentation dictionary, the more accurate the word segmentation result of the financial domain text, and the higher the classification accuracy of the trained classification model. .

可选地，作为一种实施方式，在选择了第二候选新词并添加至分词词典中之后，计算第二候选新词在文本语料中的词频，并且将词频作为第二候选新词在分词词典中的权重。对于第一候选新词可以采用同样的方式计算词频并作为其在分词词典中的权重。Optionally, as an embodiment, after the second candidate new word is selected and added to the word segmentation dictionary, the word frequency of the second candidate new word in the text corpus is calculated, and the word frequency is used as the second candidate new word in the word segmentation. Weights in the dictionary. For the first candidate new word, the word frequency can be calculated in the same way as its weight in the word segmentation dictionary.

A3、获取样本集，按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注。A3. Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode.

获取用于训练文本分类模型的样本集，针对样本集中的每一条训练数据，获取多个标注人按照预设情感倾向分类模式对每一条训练数据的多个标注信息，并选择多个标注信息中出现次数最多的标注信息作为该训练数据的标注结果。其中，用户可以根据待分析的金融问题设置对应的情感倾向分类模式，例如，将股票论坛中的文本划分为持有、卖出和买入；将微博或者论坛中的股票讨论文本划分为积极、消极和中性；将财经新闻文本划分为正向、负向和中性等。Obtain a sample set for training the text classification model, and for each piece of training data in the sample set, obtain multiple labelers for each piece of training data according to the preset sentiment tendency classification mode, and select multiple labeling information. The annotation information with the most occurrences is used as the annotation result of the training data. Among them, the user can set the corresponding sentiment classification mode according to the financial issue to be analyzed, for example, the text in the stock forum is divided into hold, sell and buy; the stock discussion text in Weibo or the forum is divided into positive , Negative and Neutral; divide financial news text into positive, negative and neutral, etc.

A4、基于添加了候选新词的所述分词词典，使用预设的分词算法对所述样本集中的训练样本进行分词处理。A4. Based on the word segmentation dictionary to which candidate new words are added, use a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set.

A5、根据分词结果提取词向量，基于adaboost算法，将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练，将训练得到的多个弱分类器组合为金融领域的文本分类模型。A5. Extract the word vector according to the word segmentation result. Based on the adaboost algorithm, input the word vector corresponding to the training sample and the labeled category information into the preset multiple weak classifiers for training, and combine the multiple weak classifiers obtained by training as Text classification models in finance.

在完成了对训练样本的标注之后，基于经过多次扩充的分词词典，使用预设分词算法对于训练样本分词处理，使用训练后的词向量模型提取分词结果的词向量。需要说明的是，本实施例的方案中采用的分词算法始终为同一个算法。After completing the labeling of the training samples, based on the word segmentation dictionary that has been expanded many times, the preset word segmentation algorithm is used to process the word segmentation of the training samples, and the word vector of the word segmentation result is extracted by using the word vector model after training. It should be noted that the word segmentation algorithm used in the solution of this embodiment is always the same algorithm.

在本实施例中，为了提高文本分类模型的准确度，分别使用word2vec模型和Glove(Global Vectors for word representation)模型提取词向量，每一个分词结果都会得到两种词向量。此外，本实施例中将基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器作为弱分类器。针对每一个弱分类器，分别将上述两种词向量作为输入，则实际上可以构建六个弱分类模型。基于adaboost算法，使用样本集中的样本训练各个弱分类器。在训练过程中，如果某样本已经被准确地分类，那么在构造下一个样本集中，降低该样本的权值；如果某样本未被准确地分类，那么在构造下一个样本集中，提高该样本的权值。权值更新过的样本集被用于训练下一个分类器，整个训练过程如此迭代地进行下去。此外，各个弱分类器的训练过程结束后，加大分类误差率小的弱分类器的权重，使其在最终的分类函数中起着较大的决定作用，而降低分类误差率大的弱分类器的权重，使其在最终的分类函数中起着较小的决定作用。按照上述过程迭代地训练各个弱分类器。将每次训练得到的弱分类器融合起来，作为最终的文本分类模型。该文本分类模型可以用于对金融领域文本进行情感倾向的分类，用于判断论坛中的股票讨论文本是消极、积极或者中性等。In this embodiment, in order to improve the accuracy of the text classification model, the word2vec model and the Glove (Global Vectors for word representation) model are respectively used to extract word vectors, and two word vectors are obtained for each word segmentation result. In addition, in this embodiment, the classifier based on the convolutional neural network algorithm, the classifier based on the cyclic neural network algorithm, and the classifier based on the long short-term memory network algorithm are used as weak classifiers. For each weak classifier, using the above two word vectors as input, six weak classification models can actually be constructed. Based on the adaboost algorithm, each weak classifier is trained using the samples in the sample set. In the training process, if a sample has been accurately classified, then in constructing the next sample set, reduce the weight of the sample; if a sample has not been accurately classified, in constructing the next sample set, increase the weight of the sample weight. The sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively. In addition, after the training process of each weak classifier is completed, the weight of the weak classifier with a small classification error rate is increased, so that it plays a larger decisive role in the final classification function, and the weak classification with a large classification error rate is reduced. The weight of the classifier so that it plays a less decisive role in the final classification function. Each weak classifier is iteratively trained as described above. The weak classifiers obtained from each training are fused as the final text classification model. The text classification model can be used to classify the sentimental tendency of texts in the financial field, and can be used to judge whether the stock discussion texts in the forum are negative, positive or neutral.

本实施例提出的文本分类模型的生成装置，通过对金融领域的文本语料挖掘，尽可能多的从语料中筛选新的金融领域词语，添加到分词词典中，实现对金融领域分词词典的扩充，并且使用扩充了金融词汇后的分词词典对样本集中的训练样本进行分词处理，并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注，最终训练得到文本分类模型，该模型能够应用于金融领域的情感倾向分类问题。The device for generating a text classification model proposed in this embodiment selects as many new words in the financial field as possible from the corpus by mining the text corpus in the financial field, and adds them to the word segmentation dictionary, thereby realizing the expansion of the financial field word segmentation dictionary. And use the word segmentation dictionary with the expanded financial vocabulary to perform word segmentation processing on the training samples in the sample set, and label the sample data in the sample set according to the preset sentiment tendency classification mode, and finally train a text classification model, which can be applied to The problem of sentiment orientation classification in the financial domain.

可选地，在其他的实施例中，模型生成程序还可以被分割为一个或者多个模块，一个或者多个模块被存储于存储器11中，并由一个或多个处理器(本实施例为处理器12)所执行以完成本发明，本发明所称的模块是指能够完成特定功能的一系列计算机程序指令段，用于描述模型生成程序在文本分类模型的生成装置中的执行过程。Optionally, in other embodiments, the model generation program can also be divided into one or more modules, one or more modules are stored in the memory 11, and are processed by one or more processors (in this embodiment, The processor 12) is executed to complete the present invention. The module referred to in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is used to describe the execution process of the model generation program in the text classification model generation device.

例如，参照图2所示，为本发明文本分类模型的生成装置一实施例中的模型生成程序的程序模块示意图，该实施例中，模型生成程序可以被分割为数据获取模块10、新词选择模块20、样本标注模块30、样本分词模块40和模型训练模块50，示例性地：For example, referring to FIG. 2, it is a schematic diagram of program modules of a model generation program in an embodiment of a text classification model generation device of the present invention. In this embodiment, the model generation program can be divided into a data acquisition module 10, a new word selection module Module 20, sample labeling module 30, sample word segmentation module 40 and model training module 50, exemplarily:

数据获取模块10用于：获取基于收集的金融领域词汇构建的金融领域的分词词典，以及预设的金融领域的文本语料；The data acquisition module 10 is used for: acquiring a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

新词选择模块20用于：根据预设算法从所述文本语料中选择候选新词，添加至所述分词词典；The new word selection module 20 is used to: select candidate new words from the text corpus according to a preset algorithm, and add them to the word segmentation dictionary;

样本标注模块30用于：获取样本集，按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注；The sample labeling module 30 is configured to: obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

样本分词模块40用于：基于添加了候选新词的所述分词词典，使用预设的分词算法对所述样本集中的训练样本进行分词处理；The sample word segmentation module 40 is configured to: use a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set based on the word segmentation dictionary to which candidate new words are added;

模型训练模块50用于：根据分词结果提取词向量，基于adaboost算法，将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练，将训练得到的多个弱分类器组合为金融领域的文本分类模型。The model training module 50 is used for: extracting word vectors according to the word segmentation results, and based on the adaboost algorithm, inputting the word vectors corresponding to the training samples and the labeled category information into a plurality of preset weak classifiers for training, and training a plurality of weak classifiers obtained by training. Weak classifiers are combined into a text classification model in the financial domain.

上述数据获取模块10、新词选择模块20、样本标注模块30、样本分词模块40和模型训练模块50等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同，在此不再赘述。The functions or operation steps realized when the program modules such as the data acquisition module 10, the new word selection module 20, the sample labeling module 30, the sample word segmentation module 40, and the model training module 50 are executed are substantially the same as those in the above-mentioned embodiment, and are not repeated here. Repeat.

此外，本发明还提供一种文本分类模型的生成方法。参照图3所示，为本发明文本分类模型的生成方法较佳实施例的流程图。该方法可以由一个装置执行，该装置可以由软件和/或硬件实现。In addition, the present invention also provides a method for generating a text classification model. Referring to FIG. 3 , it is a flowchart of a preferred embodiment of a method for generating a text classification model according to the present invention. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

在本实施例中，文本分类模型的生成方法包括：In this embodiment, the method for generating a text classification model includes:

步骤S10，获取基于收集的金融领域词汇构建的金融领域的分词词典，以及预设的金融领域的文本语料。Step S10, acquiring a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field.

步骤S20，根据预设算法从所述文本语料中选择候选新词，添加至所述分词词典。Step S20: Select candidate new words from the text corpus according to a preset algorithm and add them to the word segmentation dictionary.

在上述分词词典的基础上，从新的文本预料中选择候选新词添加到当前的分词词典中。具体地，步骤S20包括：基于所述分词词典，使用所述分词算法对所述文本语料进行分词处理，根据所述分词结果获取候选词集合；计算所述候选词集合中各个候选词的信息增益，选择信息增益大于第一预设阈值的候选词作为第一候选新词，将所述第一候选新词添加到所述分词词典中；基于添加了所述第一候选新词的分词词典，使用所述分词算法对所述文本语料进行分词，并使用分词处理后的文本语料训练词向量模型；使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度；将语义相似度大于第二预设阈值的词作为第二候选新词，并将所述第二候选新词添加到所述分词词典中。On the basis of the above word segmentation dictionary, candidate new words are selected from the new text prediction and added to the current word segmentation dictionary. Specifically, step S20 includes: using the word segmentation algorithm to perform word segmentation on the text corpus based on the word segmentation dictionary, and obtaining a candidate word set according to the word segmentation result; calculating the information gain of each candidate word in the candidate word set , select the candidate word whose information gain is greater than the first preset threshold as the first candidate new word, and add the first candidate new word to the word segmentation dictionary; based on the word segmentation dictionary to which the first candidate new word is added, Use the word segmentation algorithm to segment the text corpus, and use the word segmentation processed text corpus to train a word vector model; use the word vector model obtained by training to calculate the word in the word segmentation result The semantic similarity of the first candidate new word degree; take the word whose semantic similarity is greater than the second preset threshold as the second candidate new word, and add the second candidate new word to the word segmentation dictionary.

表示特征项t_i不出现的概率，

represents the probability that the feature item t _i does not appear,

步骤S30，获取样本集，按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注。In step S30, a sample set is obtained, and the training samples in the sample set are classified according to a preset emotional tendency classification mode.

获取用于训练文本分类模型的样本集，针对样本集中的每一条训练数据，获取多个标注人按照预设情感倾向分类模式对每一条训练数据的多个标注信息，并选择多个标注信息中出现次数最多的标注信息作为该训练数据的标注结果。其中，用户可以根据待分析的金融问题设置对应的情感倾向分类模式，例如，将股票论坛文本划分为持有、卖出和买入；将微博股票讨论文本划分为积极、消极和中性；将财经新闻文本划分为正向、负向和中性等。Obtain a sample set for training the text classification model, and for each piece of training data in the sample set, obtain multiple labelers for each piece of training data according to the preset sentiment tendency classification mode, and select multiple labeling information. The annotation information with the most occurrences is used as the annotation result of the training data. Among them, the user can set the corresponding sentiment tendency classification mode according to the financial issue to be analyzed, for example, divide the text of the stock forum into hold, sell and buy; divide the text of the Weibo stock discussion into positive, negative and neutral; Divide financial news texts into positive, negative, neutral, etc.

步骤S40，基于添加了候选新词的所述分词词典，使用预设的分词算法对所述样本集中的训练样本进行分词处理。Step S40, based on the word segmentation dictionary to which the candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set.

步骤S50，根据分词结果提取词向量，基于adaboost算法，将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练，将训练得到的多个弱分类器组合为金融领域的文本分类模型。Step S50, extracting word vectors according to the word segmentation results, and based on the adaboost algorithm, inputting the word vectors corresponding to the training samples and the labeled category information into a plurality of preset weak classifiers for training, and combining the plurality of weak classifiers obtained by training A text classification model for the financial domain.

在本实施例中，为了提高文本分类模型的准确度，分别使用word2vec模型和Glove模型提取词向量，每一个分词结果都会得到两种词向量。此外，本实施例中将基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器作为弱分类器。针对每一个弱分类器，分别将上述两种词向量作为输入，则实际上可以构建六个弱分类模型。基于adaboost算法，使用样本集中的样本训练各个弱分类器。在训练过程中，如果某样本已经被准确地分类，那么在构造下一个样本集中，降低该样本的权值；如果某样本未被准确地分类，那么在构造下一个样本集中，提高该样本的权值。权值更新过的样本集被用于训练下一个分类器，整个训练过程如此迭代地进行下去。此外，各个弱分类器的训练过程结束后，加大分类误差率小的弱分类器的权重，使其在最终的分类函数中起着较大的决定作用，而降低分类误差率大的弱分类器的权重，使其在最终的分类函数中起着较小的决定作用。按照上述过程迭代地训练各个弱分类器。将每次训练得到的弱分类器融合起来，作为最终的文本分类模型。该文本分类模型可以用于对金融领域文本进行情感倾向的分类，用于判断论坛中的股票讨论文本是消极、积极或者中性等。In this embodiment, in order to improve the accuracy of the text classification model, the word2vec model and the Glove model are respectively used to extract word vectors, and two word vectors are obtained for each word segmentation result. In addition, in this embodiment, the classifier based on the convolutional neural network algorithm, the classifier based on the cyclic neural network algorithm, and the classifier based on the long short-term memory network algorithm are used as weak classifiers. For each weak classifier, using the above two word vectors as input, six weak classification models can actually be constructed. Based on the adaboost algorithm, each weak classifier is trained using the samples in the sample set. In the training process, if a sample has been accurately classified, then in constructing the next sample set, reduce the weight of the sample; if a sample has not been accurately classified, in constructing the next sample set, increase the weight of the sample weight. The sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively. In addition, after the training process of each weak classifier is completed, the weight of the weak classifier with a small classification error rate is increased, so that it plays a larger decisive role in the final classification function, and the weak classification with a large classification error rate is reduced. The weight of the classifier so that it plays a less decisive role in the final classification function. Each weak classifier is iteratively trained as described above. The weak classifiers obtained from each training are fused as the final text classification model. The text classification model can be used to classify the sentimental tendency of texts in the financial field, and can be used to judge whether the stock discussion texts in the forum are negative, positive or neutral.

本实施例提出的文本分类模型的生成方法，通过对金融领域的文本语料挖掘，尽可能多的从语料中筛选新的金融领域词语，添加到分词词典中，实现对金融领域分词词典的扩充，并且使用扩充了金融词汇后的分词词典对样本集中的训练样本进行分词处理，并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注，最终训练得到文本分类模型，该模型能够应用于金融领域的情感倾向分类问题。The method for generating a text classification model proposed in this embodiment, by mining the text corpus in the financial field, selects as many new financial field words as possible from the corpus, and adds them to the word segmentation dictionary, so as to realize the expansion of the financial field word segmentation dictionary, And use the word segmentation dictionary with the expanded financial vocabulary to perform word segmentation processing on the training samples in the sample set, and label the sample data in the sample set according to the preset sentiment tendency classification mode, and finally train a text classification model, which can be applied to The problem of sentiment orientation classification in the financial domain.

此外，本发明实施例还提出一种计算机可读存储介质，所述计算机可读存储介质上存储有模型生成程序，所述模型生成程序可被一个或多个处理器执行，以实现如下操作：In addition, an embodiment of the present invention also provides a computer-readable storage medium, where a model generation program is stored on the computer-readable storage medium, and the model generation program can be executed by one or more processors to achieve the following operations:

本发明计算机可读存储介质具体实施方式与上述文本分类模型的生成装置和方法各实施例基本相同，在此不作累述。The specific implementations of the computer-readable storage medium of the present invention are basically the same as the above-mentioned embodiments of the apparatus and method for generating a text classification model, and will not be described in detail here.

需要说明的是，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprising", "comprising" or any other variation thereof herein are intended to encompass a non-exclusive inclusion such that a process, device, article or method comprising a list of elements includes not only those elements, but also includes no explicit Other elements listed, or those inherent to such a process, apparatus, article, or method are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on such understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disc), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

1. An apparatus for generating a text classification model, the apparatus comprising a memory and a processor, the memory having stored thereon a model generator executable on the processor, the model generator when executed by the processor implementing the steps of:

acquiring a word segmentation dictionary of the financial field constructed based on the collected financial field vocabularies and a preset text corpus of the financial field;

selecting new candidate words from the text corpus according to a preset algorithm, adding the new candidate words to the segmentation dictionary, selecting new candidate words from the text corpus according to the preset algorithm, and adding the new candidate words to the segmentation dictionary comprises the following steps:

based on the word segmentation dictionary, performing word segmentation processing on the text corpus by using the word segmentation algorithm, and acquiring a candidate word set according to the word segmentation result;

calculating information gain of each candidate word in the candidate word set, selecting a candidate word with the information gain larger than a first preset threshold value as a first candidate new word, and adding the first candidate new word into the word segmentation dictionary;

based on the word segmentation dictionary added with the first candidate new word, performing word segmentation on the text corpus by using the word segmentation algorithm, and training a word vector model by using the text corpus after word segmentation processing;

calculating semantic similarity between the words in the word segmentation result and the first candidate new words by using a word vector model obtained by training;

taking words with semantic similarity larger than a second preset threshold as second candidate new words, and adding the second candidate new words into the word segmentation dictionary; the information gain calculation formula of each candidate word is as follows:

in the above formula, P (C)_j) Represents class C_jProbability of occurrence in the dataset, P (t)_i) Representing a feature item t_iProbability of occurrence in the data set, P (C)_j|t_i) Representing a feature item t_iPresent in decision class C_jThe probability in the document of (a) is,

representing a feature item t_iThe probability of the non-occurrence of the event,

representing a feature item t_iPresent in a category other than C_jThe probability, | c | is the total number of categories, wherein the categories refer to the classification of emotional tendency, the feature items are candidate words, and the probability values can be obtained by calculating the statistical conditions of the candidate words in the text corpus;

acquiring a sample set, and carrying out class marking on training samples in the sample set according to a preset emotional tendency classification mode;

based on the word segmentation dictionary added with the candidate new words, performing word segmentation processing on training samples in the sample set by using a preset word segmentation algorithm;

extracting word vectors according to word segmentation results, inputting the word vectors corresponding to training samples and labeled class information into a plurality of preset weak classifiers for training based on an adaboost algorithm, and combining the plurality of weak classifiers obtained by training into a text classification model in the financial field.

2. The apparatus for generating a text classification model according to claim 1, wherein the processor is further operative to execute the model generation program to, after the step of taking the word having a semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, further implement the steps of:

and calculating the word frequency of the second candidate new word in the text corpus, and taking the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.

3. The apparatus for generating a text classification model according to any one of claims 1 or 2, wherein the step of obtaining a sample set and labeling training samples in the sample set according to a preset emotional tendency classification mode comprises:

the method comprises the steps of obtaining a sample set, obtaining a plurality of marking information obtained by marking training samples in the sample set by a plurality of marking persons according to a preset emotional tendency classification mode, and selecting the marking information with the largest occurrence frequency from the marking information as a marking result of the corresponding training sample.

4. The apparatus for generating a text classification model according to any one of claims 1 or 2, characterized in that the weak classifiers include a convolutional neural network algorithm-based classifier, a cyclic neural network algorithm-based classifier, and a long-short term memory network algorithm-based classifier.

5. A method for generating a text classification model, the method comprising:

6. The method for generating a text classification model according to claim 5, wherein after the step of taking the word with semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, the method further comprises the steps of:

7. The method for generating the text classification model according to any one of claims 5 or 6, wherein the step of obtaining the sample set and labeling the training samples in the sample set according to the preset emotional tendency classification mode comprises the following steps:

8. A computer-readable storage medium having stored thereon a model generation program executable by one or more processors to perform the steps of the method of generating a text classification model according to any one of claims 5 to 7.