CN108804512B - Apparatus, method and computer-readable storage medium for generating text classification model - Google Patents

Apparatus, method and computer-readable storage medium for generating text classification model Download PDF

Info

Publication number
CN108804512B
CN108804512B CN201810361702.0A CN201810361702A CN108804512B CN 108804512 B CN108804512 B CN 108804512B CN 201810361702 A CN201810361702 A CN 201810361702A CN 108804512 B CN108804512 B CN 108804512B
Authority
CN
China
Prior art keywords
word
word segmentation
candidate
text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810361702.0A
Other languages
Chinese (zh)
Other versions
CN108804512A (en
Inventor
王健宗
吴天博
黄章成
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810361702.0A priority Critical patent/CN108804512B/en
Priority to PCT/CN2018/102400 priority patent/WO2019200806A1/en
Publication of CN108804512A publication Critical patent/CN108804512A/en
Application granted granted Critical
Publication of CN108804512B publication Critical patent/CN108804512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种文本分类模型的生成装置,包括存储器和处理器,存储器上存储有可在处理器上运行的模型生成程序,该程序被处理器执行时实现如下步骤:获取金融领域的分词词典以及金融领域的文本语料;从文本语料中选择候选新词添加至分词词典;获取样本集并对样本集中的训练样本进行类别标注;基于添加了候选新词的分词词典,使用预设的分词算法对样本集中的训练样本进行分词并提取词向量,基于adaboost算法,将词向量和标注的类别信息输入到多个弱分类器中训练,得到文本分类模型。本发明还提出一种文本分类模型的生成方法以及一种计算机可读存储介质。本发明解决了现有技术中无法实现对金融领域文本进行情感倾向性的分类的问题。

Figure 201810361702

The invention discloses a text classification model generating device, comprising a memory and a processor. The memory stores a model generating program that can be run on the processor. When the program is executed by the processor, the following steps are implemented: obtaining word segmentation in the financial field. Dictionary and text corpus in the financial field; select candidate new words from the text corpus to add to the word segmentation dictionary; obtain a sample set and label the training samples in the sample set; use the preset word segmentation based on the word segmentation dictionary with candidate new words added The algorithm performs word segmentation on the training samples in the sample set and extracts word vectors. Based on the adaboost algorithm, the word vectors and labeled category information are input into multiple weak classifiers for training to obtain a text classification model. The invention also provides a method for generating a text classification model and a computer-readable storage medium. The invention solves the problem in the prior art that the emotional tendency classification of texts in the financial field cannot be realized.

Figure 201810361702

Description

文本分类模型的生成装置、方法及计算机可读存储介质Apparatus, method and computer-readable storage medium for generating text classification model

技术领域technical field

本发明涉及文本分类技术领域,尤其涉及一种文本分类模型的生成装置、方法及计算机可读存储介质。The present invention relates to the technical field of text classification, and in particular, to an apparatus, method and computer-readable storage medium for generating a text classification model.

背景技术Background technique

随着互联网和信息技术的发展,越来越多的机构和个人通过互联网途径以各种方式发表对各种事物的观点、态度和立场,如各种新闻评论、论坛以及社交网站等。这些海量的信息对于电子商务、市场预测等各个方面具有一定的商业价值,特别是金融行业,是互联网信息增长最快,受影响最大的行业,因此,对金融文本信息进行情感倾向分析以进行更加深入的研究逐渐成为重要课题。With the development of the Internet and information technology, more and more institutions and individuals express their views, attitudes and positions on various things in various ways through the Internet, such as various news reviews, forums and social networking sites. These massive amounts of information have certain commercial value for e-commerce, market forecasting and other aspects, especially the financial industry, which is the industry with the fastest growth and the greatest impact on Internet information. In-depth research has gradually become an important topic.

文本情感倾向性分析是属于文本情感分析的一部分,通过情感倾向性分析,可以掌握本文的褒贬性倾向,对于金融领域来说,新闻舆情是体现市场和行业的景气程度以及投资者的交易热情的重要指标,因此,对金融领域的文本的情感倾向性的分析对于金融时长的研究具有剧组轻重的影响,但是现有技术中还缺乏实现对金融领域文本进行情感倾向的分类的方案,导致无法实现对金融领域文本进行情感倾向性的分类。Text sentiment orientation analysis is a part of text sentiment analysis. Through sentiment orientation analysis, we can grasp the positive and negative tendencies of this article. For the financial field, news public opinion reflects the prosperity of the market and industry and the enthusiasm of investors for trading. An important indicator, therefore, the analysis of the emotional tendencies of texts in the financial field has a significant impact on the study of financial duration. However, there is still a lack of solutions for classifying emotional tendencies of texts in the financial field in the prior art, which makes it impossible to achieve Sentiment-oriented classification of texts in the financial domain.

发明内容SUMMARY OF THE INVENTION

本发明提供一种文本分类模型的生成装置、方法及计算机可读存储介质,其主要目的在于提出一种可以用于金融领域文本的情感倾向分类的文本分类模型的生成装置,以解决现有技术中无法实现对金融领域文本进行情感倾向性的分类的问题。The present invention provides a text classification model generation device, method and computer-readable storage medium, the main purpose of which is to propose a text classification model generation device that can be used for emotional tendency classification of texts in the financial field, so as to solve the problem of the prior art In this paper, it is impossible to realize the classification of sentimental tendency of texts in the financial field.

为实现上述目的,本发明提供一种文本分类模型的生成装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的模型生成程序,所述模型生成程序被所述处理器执行时实现如下步骤:In order to achieve the above object, the present invention provides a text classification model generation device, the device includes a memory and a processor, the memory stores a model generation program that can be run on the processor, and the model generation program is The processor implements the following steps when executing:

获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料;Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典;According to a preset algorithm, candidate new words are selected from the text corpus and added to the word segmentation dictionary;

获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注;Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理;Based on the word segmentation dictionary to which candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set;

根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。The word vector is extracted according to the word segmentation result. Based on the adaboost algorithm, the word vector corresponding to the training sample and the labeled category information are input into the preset multiple weak classifiers for training, and the multiple weak classifiers obtained by training are combined into the financial field. text classification model.

可选地,所述根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典的步骤包括:Optionally, the step of selecting candidate new words from the text corpus according to a preset algorithm, and adding them to the word segmentation dictionary includes:

基于所述分词词典,使用所述分词算法对所述文本语料进行分词处理,根据所述分词结果获取候选词集合;Based on the word segmentation dictionary, the word segmentation algorithm is used to perform word segmentation processing on the text corpus, and a candidate word set is obtained according to the word segmentation result;

计算所述候选词集合中各个候选词的信息增益,选择信息增益大于第一预设阈值的候选词作为第一候选新词,将所述第一候选新词添加到所述分词词典中;Calculate the information gain of each candidate word in the candidate word set, select a candidate word whose information gain is greater than a first preset threshold as the first candidate new word, and add the first candidate new word to the word segmentation dictionary;

基于添加了所述第一候选新词的分词词典,使用所述分词算法对所述文本语料进行分词,并使用分词处理后的文本语料训练词向量模型;Based on the word segmentation dictionary to which the first candidate new word is added, the word segmentation algorithm is used to segment the text corpus, and the word vector model is trained using the segmented text corpus;

使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度;Using the word vector model obtained by training to calculate the semantic similarity between the word in the word segmentation result and the first candidate new word;

将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述分词词典中。A word with a semantic similarity greater than a second preset threshold is used as a second candidate new word, and the second candidate new word is added to the word segmentation dictionary.

可选地,所述处理器还可用于执行所述模型生成程序,以在所述将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述词词典的步骤之后,还实现如下步骤:Optionally, the processor can also be configured to execute the model generation program, so that the word whose semantic similarity is greater than the second preset threshold is used as the second candidate new word, and the second candidate new word is used as the second candidate new word. After the step of adding to the word dictionary, the following steps are also implemented:

计算所述第二候选新词在文本语料中的词频,并将计算得到的词频作为该第二候选新词在所述分词词典中的权重。Calculate the word frequency of the second candidate new word in the text corpus, and use the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.

可选地,所述获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注的步骤包括:Optionally, in the acquisition of the sample set, the step of labeling the training samples in the sample set according to a preset emotional tendency classification mode includes:

获取样本集,并获取多个标注人按照预设情感倾向分类模式对样本集中的训练样本进行标注得到的多个标注信息,从所述多个标注信息中,选择出现次数最多的标注信息作为对应的训练样本的标注结果。Obtain a sample set, and obtain a plurality of labeling information obtained by labeling the training samples in the sample set by a plurality of labelers according to the preset sentiment tendency classification mode, and select the labeling information with the most occurrences as the corresponding labeling information from the plurality of labeling information The annotation results of the training samples.

可选地,所述弱分类器包括基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器。Optionally, the weak classifier includes a classifier based on a convolutional neural network algorithm, a classifier based on a recurrent neural network algorithm, and a classifier based on a long short-term memory network algorithm.

此外,为实现上述目的,本发明还提供一种文本分类模型的生成方法,该方法包括:In addition, in order to achieve the above purpose, the present invention also provides a method for generating a text classification model, the method comprising:

获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料;Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典;According to a preset algorithm, candidate new words are selected from the text corpus and added to the word segmentation dictionary;

获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注;Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理;Based on the word segmentation dictionary to which candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set;

根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。The word vector is extracted according to the word segmentation result. Based on the adaboost algorithm, the word vector corresponding to the training sample and the labeled category information are input into the preset multiple weak classifiers for training, and the multiple weak classifiers obtained by training are combined into the financial field. text classification model.

可选地,所述根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典的步骤包括:Optionally, the step of selecting candidate new words from the text corpus according to a preset algorithm, and adding them to the word segmentation dictionary includes:

基于所述分词词典,使用所述分词算法对所述文本语料进行分词处理,根据所述分词结果获取候选词集合;Based on the word segmentation dictionary, the word segmentation algorithm is used to perform word segmentation processing on the text corpus, and a candidate word set is obtained according to the word segmentation result;

计算所述候选词集合中各个候选词的信息增益,选择信息增益大于第一预设阈值的候选词作为第一候选新词,将所述第一候选新词添加到所述分词词典中;Calculate the information gain of each candidate word in the candidate word set, select a candidate word whose information gain is greater than a first preset threshold as the first candidate new word, and add the first candidate new word to the word segmentation dictionary;

基于添加了所述第一候选新词的分词词典,使用所述分词算法对所述文本语料进行分词,并使用分词处理后的文本语料训练词向量模型;Based on the word segmentation dictionary to which the first candidate new word is added, the word segmentation algorithm is used to segment the text corpus, and the word vector model is trained using the segmented text corpus;

使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度;Using the word vector model obtained by training to calculate the semantic similarity between the word in the word segmentation result and the first candidate new word;

将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述分词词典中。A word with a semantic similarity greater than a second preset threshold is used as a second candidate new word, and the second candidate new word is added to the word segmentation dictionary.

可选地,所述将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述词词典的步骤之后,所述方法还包括步骤:Optionally, after the step of taking a word with a semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, the method further includes the step of: :

计算所述第二候选新词在文本语料中的词频,并将计算得到的词频作为该第二候选新词在所述分词词典中的权重。Calculate the word frequency of the second candidate new word in the text corpus, and use the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.

可选地,所述获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注的步骤包括:Optionally, in the acquisition of the sample set, the step of labeling the training samples in the sample set according to a preset emotional tendency classification mode includes:

获取样本集,并获取多个标注人按照预设情感倾向分类模式对样本集中的训练样本进行标注得到的多个标注信息,从所述多个标注信息中,选择出现次数最多的标注信息作为对应的训练样本的标注结果。Obtain a sample set, and obtain a plurality of labeling information obtained by labeling the training samples in the sample set by a plurality of labelers according to the preset sentiment tendency classification mode, and select the labeling information with the most occurrences as the corresponding labeling information from the plurality of labeling information The annotation results of the training samples.

此外,为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有模型生成程序,所述模型生成程序可被一个或者多个处理器执行,以实现如上所述的文本分类模型的生成方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, where a model generation program is stored on the computer-readable storage medium, and the model generation program can be executed by one or more processors to achieve The steps of the generation method of the text classification model as described above.

本发明提出的文本分类模型的生成装置、方法及计算机可读存储介质,基于收集的金融领域词汇构建金融领域的分词词典,获取预设的金融领域的文本语料,根据文本语料获取候选词集合,从候选词集合中选择候选新词添加至分词词典。获取样本集,按照预设情感倾向分类模式对样本集中的训练样本进行类别标注。基于添加了候选新词的分词词典,使用预设的分词算法对样本集中的训练样本进行分词处理,根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。本发明的方案中,通过对金融领域的文本语料挖掘,筛选出候选新词添加到分词词典中,通过更新后的分词词典对样本集中的样本分词处理,并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注,最终训练得到文本分类模型,该模型能够应用于金融领域的情感倾向分类问题。The device, method and computer-readable storage medium for generating a text classification model proposed by the present invention construct a word segmentation dictionary in the financial field based on the collected vocabulary in the financial field, obtain a preset text corpus in the financial field, and obtain a candidate word set according to the text corpus, Select candidate new words from the candidate word set to add to the word segmentation dictionary. Obtain a sample set, and label the training samples in the sample set according to the preset sentiment tendency classification mode. Based on the word segmentation dictionary with the candidate new words added, the preset word segmentation algorithm is used to segment the training samples in the sample set, and word vectors are extracted according to the word segmentation results. Based on the adaboost algorithm, the word vectors corresponding to the training samples and the labeled category information are input Training is performed on multiple preset weak classifiers, and the multiple weak classifiers obtained by training are combined into a text classification model in the financial field. In the solution of the present invention, the candidate new words are selected and added to the word segmentation dictionary by mining the text corpus in the financial field, the word segmentation of the samples in the sample set is processed by the updated word segmentation dictionary, and the samples are classified according to the preset emotional tendency classification mode. The centralized sample data is classified into categories, and finally a text classification model is obtained by training, which can be applied to the classification of emotional tendencies in the financial field.

附图说明Description of drawings

图1为本发明文本分类模型的生成装置较佳实施例的示意图;1 is a schematic diagram of a preferred embodiment of a device for generating a text classification model according to the present invention;

图2为本发明文本分类模型的生成装置一实施例中模型生成程序的程序模块示意图;2 is a schematic diagram of a program module of a model generation program in an embodiment of a text classification model generation device of the present invention;

图3为本发明文本分类模型的生成方法较佳实施例的流程图。FIG. 3 is a flowchart of a preferred embodiment of a method for generating a text classification model according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提供一种文本分类模型的生成装置。参照图1所示,为本发明文本分类模型的生成装置较佳实施例的示意图。The present invention provides a device for generating a text classification model. Referring to FIG. 1 , it is a schematic diagram of a preferred embodiment of an apparatus for generating a text classification model according to the present invention.

在本实施例中,文本分类模型的生成装置可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该文本分类模型的生成装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the device for generating the text classification model may be a PC (Personal Computer, personal computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The generating apparatus 1 of the text classification model includes at least a memory 11 , a processor 12 , a communication bus 13 , and a network interface 14 .

其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是文本分类模型的生成装置1的内部存储单元,例如该文本分类模型的生成装置1的硬盘。存储器11在另一些实施例中也可以是文本分类模型的生成装置1的外部存储设备,例如文本分类模型的生成装置1上配备的插接式硬盘,智能存储卡(SmartMedia Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括文本分类模型的生成装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于文本分类模型的生成装置1的应用软件及各类数据,例如模型生成程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the apparatus 1 for generating a text classification model, such as a hard disk of the apparatus 1 for generating a text classification model. In other embodiments, the memory 11 may also be an external storage device of the text classification model generating apparatus 1, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a security Digital (Secure Digital, SD) card, flash memory card (Flash Card) and so on. Further, the memory 11 may also include both an internal storage unit of the text classification model generating apparatus 1 and an external storage device. The memory 11 can be used not only to store application software installed in the text classification model generating apparatus 1 and various types of data, such as the code of the model generating program 01 , but also to temporarily store data that has been output or will be output.

处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行模型生成程序01等。The processor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor or other data processing chips in some embodiments, for running program codes or processing stored in the memory 11 Data, such as the execution model generation program 01, etc.

通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection communication between these components.

网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间生成通信连接。The network interface 14 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is usually used to generate a communication connection between the apparatus 1 and other electronic devices.

图1仅示出了具有组件11-14以及模型生成程序01的文本分类模型的生成装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。FIG. 1 only shows the generating apparatus 1 of the text classification model with components 11-14 and the model generating program 01, but it should be understood that it is not required to implement all the shown components, and more or less may be implemented instead. s component.

可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在文本分类模型的生成装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, for displaying the information processed in the text classification model generating apparatus 1 and for displaying a visual user interface.

在图1所示的装置1实施例中,存储器11中存储有模型生成程序01;处理器12执行存储器11中存储的模型生成程序01时实现如下步骤:In the embodiment of the device 1 shown in FIG. 1 , the model generation program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the model generation program 01 stored in the memory 11:

A1、获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料。A1. Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field.

首先,获取全领域分词词典,在全领域分词词典的基础上,加入收集的金融领域的词汇构成金融领域分词词典。其中,金融领域的词汇来源主要包括以下三类:金融领域专业术语,例如“威廉指标”、“移动平均线”、“可转债”等;金融论坛用语,如一些炒股论坛中的用户在评论股票时的用语;应用于金融领域的网络用语和特定符号,例如“垃圾股”等。First, obtain a word segmentation dictionary in all fields, and on the basis of the word segmentation dictionary in all fields, add the collected vocabulary in the financial field to form a word segmentation dictionary in the financial field. Among them, the sources of vocabulary in the financial field mainly include the following three categories: professional terms in the financial field, such as "William's indicator", "moving average", "convertible bond", etc.; financial forum terms, such as some users in stock speculation forums commenting on A term for stocks; Internet terms and specific symbols applied to the financial sector, such as "junk stocks", etc.

A2、根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典。A2. Select candidate new words from the text corpus according to a preset algorithm, and add them to the word segmentation dictionary.

在上述分词词典的基础上,从新的文本预料中选择候选新词添加到当前的分词词典中。具体地,步骤A2包括:On the basis of the above word segmentation dictionary, candidate new words are selected from the new text prediction and added to the current word segmentation dictionary. Specifically, step A2 includes:

A21、基于所述分词词典,使用所述分词算法对所述文本语料进行分词处理,根据所述分词结果获取候选词集合;A22、计算所述候选词集合中各个候选词的信息增益,选择信息增益大于第一预设阈值的候选词作为第一候选新词,将所述第一候选新词添加到所述分词词典中;A23、基于添加了所述第一候选新词的分词词典,使用所述分词算法对所述文本语料进行分词,并使用分词处理后的文本语料训练词向量模型;A24、使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度;A25、将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述分词词典中。A21. Based on the word segmentation dictionary, use the word segmentation algorithm to perform word segmentation on the text corpus, and obtain a candidate word set according to the word segmentation result; A22, calculate the information gain of each candidate word in the candidate word set, and select information A candidate word whose gain is greater than the first preset threshold is used as the first candidate new word, and the first candidate new word is added to the word segmentation dictionary; A23. Based on the word segmentation dictionary to which the first candidate new word is added, use The word segmentation algorithm performs word segmentation on the text corpus, and uses the word segmentation processed text corpus to train a word vector model; A24, using the word vector model obtained by training to calculate the semantics of the word in the word segmentation result and the first candidate new word Similarity; A25. Use a word whose semantic similarity is greater than a second preset threshold as a second candidate new word, and add the second candidate new word to the word segmentation dictionary.

获取用于扩充分词词典的文本语料。具体地,采用网络爬虫的方式从金融网站抓取大量的、与待分析的金融主题相关的金融新闻文本信息,形成文本语料。对爬取到的数据进行预处理,去除其中包含的乱码符号、web转义符号等无用信息,保留文本数据作为文本语料。接下来,通过人工标注的方式对文本语料中的大量文本数据进行情感倾向的分类,即为文本数据添加类别标注信息。Get the text corpus used to augment the word segmentation dictionary. Specifically, a large amount of financial news text information related to the financial topic to be analyzed is captured from a financial website by means of a web crawler to form a text corpus. Preprocess the crawled data, remove useless information such as garbled symbols and web escape symbols, and retain text data as text corpus. Next, classify the sentiment tendency of a large amount of text data in the text corpus by manual annotation, that is, add category annotation information to the text data.

以当前的分词词典作为预设的分词算法的词典,对文本语料进行分词处理,然后,根据预设的停用词词表过滤掉分词结果中的停用词,以去掉结果中的无关词汇,由剩余的分词结果构成候选词集合。分词结果对应的类别标注信息与其对应的文本数据的类别标注信息一致。Using the current word segmentation dictionary as the dictionary of the preset word segmentation algorithm, the word segmentation process is performed on the text corpus, and then the stop words in the word segmentation result are filtered according to the preset stop word list to remove irrelevant words in the result. A candidate word set is formed from the remaining word segmentation results. The category annotation information corresponding to the word segmentation result is consistent with the category annotation information of the corresponding text data.

接下来,根据信息增益从候选词集合中选择候选新词,其中,信息增益是一种基于熵的评估方法,其用于特征选择时,衡量的是某个词的出现与否对判断一个文本是否属于某个类所提供的信息量;其定义为某一特征值在文档中出现前后的信息量之差,计算公式为:Next, candidate new words are selected from the candidate word set according to the information gain. Information gain is an entropy-based evaluation method. When it is used for feature selection, it measures whether the appearance of a word is useful for judging a text. Whether it belongs to the amount of information provided by a certain class; it is defined as the difference between the amount of information before and after a feature value appears in the document, and the calculation formula is:

Figure BDA0001636138810000071
Figure BDA0001636138810000071

上述公式中,P(Cj)表示类别Cj在数据集中出现的概率,P(ti)表示特征项ti出现在数据集中的概率,P(Cj|ti)表示特征项ti出现在判定为类别Cj的文档中的概率,

Figure BDA0001636138810000072
表示特征项ti不出现的概率,
Figure BDA0001636138810000073
表示特征项ti出现在不属于类别Cj的文档中的概率,|c|为类别的总数。其中,类别是指情感倾向的分类,特征项是候选词。上述概率值都可以通过对候选词在文本语料中的统计情况计算得到。In the above formula, P(C j ) represents the probability that the category C j appears in the data set, P(t i ) represents the probability that the feature item t i appears in the data set, and P(C j |t i ) represents the feature item t i the probability of appearing in a document judged to be class C j ,
Figure BDA0001636138810000072
represents the probability that the feature item t i does not appear,
Figure BDA0001636138810000073
Represents the probability that the feature item t i appears in the document that does not belong to the category C j , |c| is the total number of categories. Among them, the category refers to the classification of emotional tendencies, and the feature items are candidate words. The above probability values can be obtained by calculating the statistics of the candidate words in the text corpus.

根据计算得到的信息增益判断候选词的有用程度,信息增益的值越大,则对分类越有用。将候选词集合中信息增益大于第一预设阈值的候选词作为第一候选新词,添加到当前的分词词典中,实现对分词词典的扩充。The usefulness of the candidate word is judged according to the calculated information gain. The larger the value of the information gain, the more useful it is for classification. A candidate word whose information gain is greater than the first preset threshold in the candidate word set is added to the current word segmentation dictionary as the first candidate new word to realize the expansion of the word segmentation dictionary.

基于上述经过词汇扩充的分词词典,使用同样的分词算法对上述同样的文本语料进行分词处理,获取分词结果,使用分词处理后的文本语料训练词向量模型,使用训练得到的词向量模型计算分词得到的各个词的词向量,根据词向量计算分词处理得到的词语第一候选新词的语义相似度,若语义相似度大于第二预设阈值,则将其作为第二候选新词,将从分词结果中选择出的第二候选词添加到分词词典中,实现对分词词典的再次扩充。Based on the above word segmentation dictionary after vocabulary expansion, use the same word segmentation algorithm to perform word segmentation processing on the same text corpus, obtain word segmentation results, use the word segmentation processed text corpus to train a word vector model, and use the trained word vector model to calculate word segmentation to get The word vector of each word is calculated according to the word vector, and the semantic similarity of the first candidate new word of the word obtained by the word segmentation process is calculated. If the semantic similarity is greater than the second preset threshold, it will be used as the second candidate new word. The second candidate word selected in the result is added to the word segmentation dictionary to realize the re-expansion of the word segmentation dictionary.

可以理解的是,在使用分词算法对文本语料进行分词处理后,会通过停用词词表删除掉分词结果中的停用词,因为这些停用词噪音大,且对文本分类无意义,删除这些词可以提高文本分类的准确程度,同时减小选择候选新词时的计算量。It is understandable that after the word segmentation algorithm is used to segment the text corpus, the stop words in the word segmentation result will be deleted through the stop word list, because these stop words are noisy and meaningless for text classification, so delete them. These words can improve the accuracy of text classification, while reducing the amount of calculation when selecting candidate new words.

经过上述步骤,实际上实现了对分词词典的三次扩充,第一次是通过人工收集的方式获取金融领域词汇的方式进行初步的扩充,第二次是通过计算信息增益选择新词,第三次是通过词向量计算语义相似度再次选择新词。此外,第二次和第三次扩充均是在上一次扩充后的分词词典的基础上进行的再次扩充。通过上述动态扩充的方式,尽可能多地从语料中筛选新的金融领域词汇。扩充了分词词典的分词算法用于分类模型的训练样本的分词,分词词典中的金融领域词汇越丰富,则对金融领域文本的分词结果越准确,训练得到的分类模型的分类准确度也越高。After the above steps, three expansions of the word segmentation dictionary are actually realized. The first is to obtain the vocabulary in the financial field through manual collection for preliminary expansion. The second is to select new words by calculating the information gain. The third It is to calculate the semantic similarity through the word vector and select new words again. In addition, the second and third expansions are re-expands based on the word segmentation dictionary after the previous expansion. Through the above dynamic expansion method, as many new financial domain words as possible are selected from the corpus. The word segmentation algorithm of the expanded word segmentation dictionary is used for word segmentation of the training samples of the classification model. The richer the financial domain vocabulary in the word segmentation dictionary, the more accurate the word segmentation result of the financial domain text, and the higher the classification accuracy of the trained classification model. .

可选地,作为一种实施方式,在选择了第二候选新词并添加至分词词典中之后,计算第二候选新词在文本语料中的词频,并且将词频作为第二候选新词在分词词典中的权重。对于第一候选新词可以采用同样的方式计算词频并作为其在分词词典中的权重。Optionally, as an embodiment, after the second candidate new word is selected and added to the word segmentation dictionary, the word frequency of the second candidate new word in the text corpus is calculated, and the word frequency is used as the second candidate new word in the word segmentation. Weights in the dictionary. For the first candidate new word, the word frequency can be calculated in the same way as its weight in the word segmentation dictionary.

A3、获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注。A3. Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode.

获取用于训练文本分类模型的样本集,针对样本集中的每一条训练数据,获取多个标注人按照预设情感倾向分类模式对每一条训练数据的多个标注信息,并选择多个标注信息中出现次数最多的标注信息作为该训练数据的标注结果。其中,用户可以根据待分析的金融问题设置对应的情感倾向分类模式,例如,将股票论坛中的文本划分为持有、卖出和买入;将微博或者论坛中的股票讨论文本划分为积极、消极和中性;将财经新闻文本划分为正向、负向和中性等。Obtain a sample set for training the text classification model, and for each piece of training data in the sample set, obtain multiple labelers for each piece of training data according to the preset sentiment tendency classification mode, and select multiple labeling information. The annotation information with the most occurrences is used as the annotation result of the training data. Among them, the user can set the corresponding sentiment classification mode according to the financial issue to be analyzed, for example, the text in the stock forum is divided into hold, sell and buy; the stock discussion text in Weibo or the forum is divided into positive , Negative and Neutral; divide financial news text into positive, negative and neutral, etc.

A4、基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理。A4. Based on the word segmentation dictionary to which candidate new words are added, use a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set.

A5、根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。A5. Extract the word vector according to the word segmentation result. Based on the adaboost algorithm, input the word vector corresponding to the training sample and the labeled category information into the preset multiple weak classifiers for training, and combine the multiple weak classifiers obtained by training as Text classification models in finance.

在完成了对训练样本的标注之后,基于经过多次扩充的分词词典,使用预设分词算法对于训练样本分词处理,使用训练后的词向量模型提取分词结果的词向量。需要说明的是,本实施例的方案中采用的分词算法始终为同一个算法。After completing the labeling of the training samples, based on the word segmentation dictionary that has been expanded many times, the preset word segmentation algorithm is used to process the word segmentation of the training samples, and the word vector of the word segmentation result is extracted by using the word vector model after training. It should be noted that the word segmentation algorithm used in the solution of this embodiment is always the same algorithm.

在本实施例中,为了提高文本分类模型的准确度,分别使用word2vec模型和Glove(Global Vectors for word representation)模型提取词向量,每一个分词结果都会得到两种词向量。此外,本实施例中将基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器作为弱分类器。针对每一个弱分类器,分别将上述两种词向量作为输入,则实际上可以构建六个弱分类模型。基于adaboost算法,使用样本集中的样本训练各个弱分类器。在训练过程中,如果某样本已经被准确地分类,那么在构造下一个样本集中,降低该样本的权值;如果某样本未被准确地分类,那么在构造下一个样本集中,提高该样本的权值。权值更新过的样本集被用于训练下一个分类器,整个训练过程如此迭代地进行下去。此外,各个弱分类器的训练过程结束后,加大分类误差率小的弱分类器的权重,使其在最终的分类函数中起着较大的决定作用,而降低分类误差率大的弱分类器的权重,使其在最终的分类函数中起着较小的决定作用。按照上述过程迭代地训练各个弱分类器。将每次训练得到的弱分类器融合起来,作为最终的文本分类模型。该文本分类模型可以用于对金融领域文本进行情感倾向的分类,用于判断论坛中的股票讨论文本是消极、积极或者中性等。In this embodiment, in order to improve the accuracy of the text classification model, the word2vec model and the Glove (Global Vectors for word representation) model are respectively used to extract word vectors, and two word vectors are obtained for each word segmentation result. In addition, in this embodiment, the classifier based on the convolutional neural network algorithm, the classifier based on the cyclic neural network algorithm, and the classifier based on the long short-term memory network algorithm are used as weak classifiers. For each weak classifier, using the above two word vectors as input, six weak classification models can actually be constructed. Based on the adaboost algorithm, each weak classifier is trained using the samples in the sample set. In the training process, if a sample has been accurately classified, then in constructing the next sample set, reduce the weight of the sample; if a sample has not been accurately classified, in constructing the next sample set, increase the weight of the sample weight. The sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively. In addition, after the training process of each weak classifier is completed, the weight of the weak classifier with a small classification error rate is increased, so that it plays a larger decisive role in the final classification function, and the weak classification with a large classification error rate is reduced. The weight of the classifier so that it plays a less decisive role in the final classification function. Each weak classifier is iteratively trained as described above. The weak classifiers obtained from each training are fused as the final text classification model. The text classification model can be used to classify the sentimental tendency of texts in the financial field, and can be used to judge whether the stock discussion texts in the forum are negative, positive or neutral.

本实施例提出的文本分类模型的生成装置,通过对金融领域的文本语料挖掘,尽可能多的从语料中筛选新的金融领域词语,添加到分词词典中,实现对金融领域分词词典的扩充,并且使用扩充了金融词汇后的分词词典对样本集中的训练样本进行分词处理,并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注,最终训练得到文本分类模型,该模型能够应用于金融领域的情感倾向分类问题。The device for generating a text classification model proposed in this embodiment selects as many new words in the financial field as possible from the corpus by mining the text corpus in the financial field, and adds them to the word segmentation dictionary, thereby realizing the expansion of the financial field word segmentation dictionary. And use the word segmentation dictionary with the expanded financial vocabulary to perform word segmentation processing on the training samples in the sample set, and label the sample data in the sample set according to the preset sentiment tendency classification mode, and finally train a text classification model, which can be applied to The problem of sentiment orientation classification in the financial domain.

可选地,在其他的实施例中,模型生成程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本发明,本发明所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述模型生成程序在文本分类模型的生成装置中的执行过程。Optionally, in other embodiments, the model generation program can also be divided into one or more modules, one or more modules are stored in the memory 11, and are processed by one or more processors (in this embodiment, The processor 12) is executed to complete the present invention. The module referred to in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is used to describe the execution process of the model generation program in the text classification model generation device.

例如,参照图2所示,为本发明文本分类模型的生成装置一实施例中的模型生成程序的程序模块示意图,该实施例中,模型生成程序可以被分割为数据获取模块10、新词选择模块20、样本标注模块30、样本分词模块40和模型训练模块50,示例性地:For example, referring to FIG. 2, it is a schematic diagram of program modules of a model generation program in an embodiment of a text classification model generation device of the present invention. In this embodiment, the model generation program can be divided into a data acquisition module 10, a new word selection module Module 20, sample labeling module 30, sample word segmentation module 40 and model training module 50, exemplarily:

数据获取模块10用于:获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料;The data acquisition module 10 is used for: acquiring a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

新词选择模块20用于:根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典;The new word selection module 20 is used to: select candidate new words from the text corpus according to a preset algorithm, and add them to the word segmentation dictionary;

样本标注模块30用于:获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注;The sample labeling module 30 is configured to: obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

样本分词模块40用于:基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理;The sample word segmentation module 40 is configured to: use a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set based on the word segmentation dictionary to which candidate new words are added;

模型训练模块50用于:根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。The model training module 50 is used for: extracting word vectors according to the word segmentation results, and based on the adaboost algorithm, inputting the word vectors corresponding to the training samples and the labeled category information into a plurality of preset weak classifiers for training, and training a plurality of weak classifiers obtained by training. Weak classifiers are combined into a text classification model in the financial domain.

上述数据获取模块10、新词选择模块20、样本标注模块30、样本分词模块40和模型训练模块50等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps realized when the program modules such as the data acquisition module 10, the new word selection module 20, the sample labeling module 30, the sample word segmentation module 40, and the model training module 50 are executed are substantially the same as those in the above-mentioned embodiment, and are not repeated here. Repeat.

此外,本发明还提供一种文本分类模型的生成方法。参照图3所示,为本发明文本分类模型的生成方法较佳实施例的流程图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。In addition, the present invention also provides a method for generating a text classification model. Referring to FIG. 3 , it is a flowchart of a preferred embodiment of a method for generating a text classification model according to the present invention. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

在本实施例中,文本分类模型的生成方法包括:In this embodiment, the method for generating a text classification model includes:

步骤S10,获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料。Step S10, acquiring a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field.

首先,获取全领域分词词典,在全领域分词词典的基础上,加入收集的金融领域的词汇构成金融领域分词词典。其中,金融领域的词汇来源主要包括以下三类:金融领域专业术语,例如“威廉指标”、“移动平均线”、“可转债”等;金融论坛用语,如一些炒股论坛中的用户在评论股票时的用语;应用于金融领域的网络用语和特定符号,例如“垃圾股”等。First, obtain a word segmentation dictionary in all fields, and on the basis of the word segmentation dictionary in all fields, add the collected vocabulary in the financial field to form a word segmentation dictionary in the financial field. Among them, the sources of vocabulary in the financial field mainly include the following three categories: professional terms in the financial field, such as "William's indicator", "moving average", "convertible bond", etc.; financial forum terms, such as some users in stock speculation forums commenting on A term for stocks; Internet terms and specific symbols applied to the financial sector, such as "junk stocks", etc.

步骤S20,根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典。Step S20: Select candidate new words from the text corpus according to a preset algorithm and add them to the word segmentation dictionary.

在上述分词词典的基础上,从新的文本预料中选择候选新词添加到当前的分词词典中。具体地,步骤S20包括:基于所述分词词典,使用所述分词算法对所述文本语料进行分词处理,根据所述分词结果获取候选词集合;计算所述候选词集合中各个候选词的信息增益,选择信息增益大于第一预设阈值的候选词作为第一候选新词,将所述第一候选新词添加到所述分词词典中;基于添加了所述第一候选新词的分词词典,使用所述分词算法对所述文本语料进行分词,并使用分词处理后的文本语料训练词向量模型;使用训练得到的词向量模型计算分词结果中的词与所述第一候选新词的语义相似度;将语义相似度大于第二预设阈值的词作为第二候选新词,并将所述第二候选新词添加到所述分词词典中。On the basis of the above word segmentation dictionary, candidate new words are selected from the new text prediction and added to the current word segmentation dictionary. Specifically, step S20 includes: using the word segmentation algorithm to perform word segmentation on the text corpus based on the word segmentation dictionary, and obtaining a candidate word set according to the word segmentation result; calculating the information gain of each candidate word in the candidate word set , select the candidate word whose information gain is greater than the first preset threshold as the first candidate new word, and add the first candidate new word to the word segmentation dictionary; based on the word segmentation dictionary to which the first candidate new word is added, Use the word segmentation algorithm to segment the text corpus, and use the word segmentation processed text corpus to train a word vector model; use the word vector model obtained by training to calculate the word in the word segmentation result The semantic similarity of the first candidate new word degree; take the word whose semantic similarity is greater than the second preset threshold as the second candidate new word, and add the second candidate new word to the word segmentation dictionary.

获取用于扩充分词词典的文本语料。具体地,采用网络爬虫的方式从金融网站抓取大量的、与待分析的金融主题相关的金融新闻文本信息,形成文本语料。对爬取到的数据进行预处理,去除其中包含的乱码符号、web转义符号等无用信息,保留文本数据作为文本语料。接下来,通过人工标注的方式对文本语料中的大量文本数据进行情感倾向的分类,即为文本数据添加类别标注信息。Get the text corpus used to augment the word segmentation dictionary. Specifically, a large amount of financial news text information related to the financial topic to be analyzed is captured from a financial website by means of a web crawler to form a text corpus. Preprocess the crawled data, remove useless information such as garbled symbols and web escape symbols, and retain text data as text corpus. Next, classify the sentiment tendency of a large amount of text data in the text corpus by manual annotation, that is, add category annotation information to the text data.

以当前的分词词典作为预设的分词算法的词典,对文本语料进行分词处理,然后,根据预设的停用词词表过滤掉分词结果中的停用词,以去掉结果中的无关词汇,由剩余的分词结果构成候选词集合。分词结果对应的类别标注信息与其对应的文本数据的类别标注信息一致。Using the current word segmentation dictionary as the dictionary of the preset word segmentation algorithm, the word segmentation process is performed on the text corpus, and then the stop words in the word segmentation result are filtered according to the preset stop word list to remove irrelevant words in the result. A candidate word set is formed from the remaining word segmentation results. The category annotation information corresponding to the word segmentation result is consistent with the category annotation information of the corresponding text data.

接下来,根据信息增益从候选词集合中选择候选新词,其中,信息增益是一种基于熵的评估方法,其用于特征选择时,衡量的是某个词的出现与否对判断一个文本是否属于某个类所提供的信息量;其定义为某一特征值在文档中出现前后的信息量之差,计算公式为:Next, candidate new words are selected from the candidate word set according to the information gain. Information gain is an entropy-based evaluation method. When it is used for feature selection, it measures whether the appearance of a word is useful for judging a text. Whether it belongs to the amount of information provided by a certain class; it is defined as the difference between the amount of information before and after a feature value appears in the document, and the calculation formula is:

Figure BDA0001636138810000121
Figure BDA0001636138810000121

上述公式中,P(Cj)表示类别Cj在数据集中出现的概率,P(ti)表示特征项ti出现在数据集中的概率,P(Cj|ti)表示特征项ti出现在判定为类别Cj的文档中的概率,

Figure BDA0001636138810000122
表示特征项ti不出现的概率,
Figure BDA0001636138810000123
表示特征项ti出现在不属于类别Cj的文档中的概率,|c|为类别的总数。其中,类别是指情感倾向的分类,特征项是候选词。上述概率值都可以通过对候选词在文本语料中的统计情况计算得到。In the above formula, P(C j ) represents the probability that the category C j appears in the data set, P(t i ) represents the probability that the feature item t i appears in the data set, and P(C j |t i ) represents the feature item t i the probability of appearing in a document judged to be class C j ,
Figure BDA0001636138810000122
represents the probability that the feature item t i does not appear,
Figure BDA0001636138810000123
Represents the probability that the feature item t i appears in the document that does not belong to the category C j , |c| is the total number of categories. Among them, the category refers to the classification of emotional tendencies, and the feature items are candidate words. The above probability values can be obtained by calculating the statistics of the candidate words in the text corpus.

根据计算得到的信息增益判断候选词的有用程度,信息增益的值越大,则对分类越有用。将候选词集合中信息增益大于第一预设阈值的候选词作为第一候选新词,添加到当前的分词词典中,实现对分词词典的扩充。The usefulness of the candidate word is judged according to the calculated information gain. The larger the value of the information gain, the more useful it is for classification. A candidate word whose information gain is greater than the first preset threshold in the candidate word set is added to the current word segmentation dictionary as the first candidate new word to realize the expansion of the word segmentation dictionary.

基于上述经过词汇扩充的分词词典,使用同样的分词算法对上述同样的文本语料进行分词处理,获取分词结果,使用分词处理后的文本语料训练词向量模型,使用训练得到的词向量模型计算分词得到的各个词的词向量,根据词向量计算分词处理得到的词语第一候选新词的语义相似度,若语义相似度大于第二预设阈值,则将其作为第二候选新词,将从分词结果中选择出的第二候选词添加到分词词典中,实现对分词词典的再次扩充。Based on the above word segmentation dictionary after vocabulary expansion, use the same word segmentation algorithm to perform word segmentation processing on the same text corpus, obtain word segmentation results, use the word segmentation processed text corpus to train a word vector model, and use the trained word vector model to calculate word segmentation to get The word vector of each word is calculated according to the word vector, and the semantic similarity of the first candidate new word of the word obtained by the word segmentation process is calculated. If the semantic similarity is greater than the second preset threshold, it will be used as the second candidate new word. The second candidate word selected in the result is added to the word segmentation dictionary to realize the re-expansion of the word segmentation dictionary.

可以理解的是,在使用分词算法对文本语料进行分词处理后,会通过停用词词表删除掉分词结果中的停用词,因为这些停用词噪音大,且对文本分类无意义,删除这些词可以提高文本分类的准确程度,同时减小选择候选新词时的计算量。It is understandable that after the word segmentation algorithm is used to segment the text corpus, the stop words in the word segmentation result will be deleted through the stop word list, because these stop words are noisy and meaningless for text classification, so delete them. These words can improve the accuracy of text classification, while reducing the amount of calculation when selecting candidate new words.

经过上述步骤,实际上实现了对分词词典的三次扩充,第一次是通过人工收集的方式获取金融领域词汇的方式进行初步的扩充,第二次是通过计算信息增益选择新词,第三次是通过词向量计算语义相似度再次选择新词。此外,第二次和第三次扩充均是在上一次扩充后的分词词典的基础上进行的再次扩充。通过上述动态扩充的方式,尽可能多地从语料中筛选新的金融领域词汇。扩充了分词词典的分词算法用于分类模型的训练样本的分词,分词词典中的金融领域词汇越丰富,则对金融领域文本的分词结果越准确,训练得到的分类模型的分类准确度也越高。After the above steps, three expansions of the word segmentation dictionary are actually realized. The first is to obtain the vocabulary in the financial field through manual collection for preliminary expansion. The second is to select new words by calculating the information gain. The third It is to calculate the semantic similarity through the word vector and select new words again. In addition, the second and third expansions are re-expands based on the word segmentation dictionary after the previous expansion. Through the above dynamic expansion method, as many new financial domain words as possible are selected from the corpus. The word segmentation algorithm of the expanded word segmentation dictionary is used for word segmentation of the training samples of the classification model. The richer the financial domain vocabulary in the word segmentation dictionary, the more accurate the word segmentation result of the financial domain text, and the higher the classification accuracy of the trained classification model. .

可选地,作为一种实施方式,在选择了第二候选新词并添加至分词词典中之后,计算第二候选新词在文本语料中的词频,并且将词频作为第二候选新词在分词词典中的权重。对于第一候选新词可以采用同样的方式计算词频并作为其在分词词典中的权重。Optionally, as an embodiment, after the second candidate new word is selected and added to the word segmentation dictionary, the word frequency of the second candidate new word in the text corpus is calculated, and the word frequency is used as the second candidate new word in the word segmentation. Weights in the dictionary. For the first candidate new word, the word frequency can be calculated in the same way as its weight in the word segmentation dictionary.

步骤S30,获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注。In step S30, a sample set is obtained, and the training samples in the sample set are classified according to a preset emotional tendency classification mode.

获取用于训练文本分类模型的样本集,针对样本集中的每一条训练数据,获取多个标注人按照预设情感倾向分类模式对每一条训练数据的多个标注信息,并选择多个标注信息中出现次数最多的标注信息作为该训练数据的标注结果。其中,用户可以根据待分析的金融问题设置对应的情感倾向分类模式,例如,将股票论坛文本划分为持有、卖出和买入;将微博股票讨论文本划分为积极、消极和中性;将财经新闻文本划分为正向、负向和中性等。Obtain a sample set for training the text classification model, and for each piece of training data in the sample set, obtain multiple labelers for each piece of training data according to the preset sentiment tendency classification mode, and select multiple labeling information. The annotation information with the most occurrences is used as the annotation result of the training data. Among them, the user can set the corresponding sentiment tendency classification mode according to the financial issue to be analyzed, for example, divide the text of the stock forum into hold, sell and buy; divide the text of the Weibo stock discussion into positive, negative and neutral; Divide financial news texts into positive, negative, neutral, etc.

步骤S40,基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理。Step S40, based on the word segmentation dictionary to which the candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set.

步骤S50,根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。Step S50, extracting word vectors according to the word segmentation results, and based on the adaboost algorithm, inputting the word vectors corresponding to the training samples and the labeled category information into a plurality of preset weak classifiers for training, and combining the plurality of weak classifiers obtained by training A text classification model for the financial domain.

在完成了对训练样本的标注之后,基于经过多次扩充的分词词典,使用预设分词算法对于训练样本分词处理,使用训练后的词向量模型提取分词结果的词向量。需要说明的是,本实施例的方案中采用的分词算法始终为同一个算法。After completing the labeling of the training samples, based on the word segmentation dictionary that has been expanded many times, the preset word segmentation algorithm is used to process the word segmentation of the training samples, and the word vector of the word segmentation result is extracted by using the word vector model after training. It should be noted that the word segmentation algorithm used in the solution of this embodiment is always the same algorithm.

在本实施例中,为了提高文本分类模型的准确度,分别使用word2vec模型和Glove模型提取词向量,每一个分词结果都会得到两种词向量。此外,本实施例中将基于卷积神经网络算法的分类器、基于循环神经网络算法的分类器和基于长短期记忆网络算法的分类器作为弱分类器。针对每一个弱分类器,分别将上述两种词向量作为输入,则实际上可以构建六个弱分类模型。基于adaboost算法,使用样本集中的样本训练各个弱分类器。在训练过程中,如果某样本已经被准确地分类,那么在构造下一个样本集中,降低该样本的权值;如果某样本未被准确地分类,那么在构造下一个样本集中,提高该样本的权值。权值更新过的样本集被用于训练下一个分类器,整个训练过程如此迭代地进行下去。此外,各个弱分类器的训练过程结束后,加大分类误差率小的弱分类器的权重,使其在最终的分类函数中起着较大的决定作用,而降低分类误差率大的弱分类器的权重,使其在最终的分类函数中起着较小的决定作用。按照上述过程迭代地训练各个弱分类器。将每次训练得到的弱分类器融合起来,作为最终的文本分类模型。该文本分类模型可以用于对金融领域文本进行情感倾向的分类,用于判断论坛中的股票讨论文本是消极、积极或者中性等。In this embodiment, in order to improve the accuracy of the text classification model, the word2vec model and the Glove model are respectively used to extract word vectors, and two word vectors are obtained for each word segmentation result. In addition, in this embodiment, the classifier based on the convolutional neural network algorithm, the classifier based on the cyclic neural network algorithm, and the classifier based on the long short-term memory network algorithm are used as weak classifiers. For each weak classifier, using the above two word vectors as input, six weak classification models can actually be constructed. Based on the adaboost algorithm, each weak classifier is trained using the samples in the sample set. In the training process, if a sample has been accurately classified, then in constructing the next sample set, reduce the weight of the sample; if a sample has not been accurately classified, in constructing the next sample set, increase the weight of the sample weight. The sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively. In addition, after the training process of each weak classifier is completed, the weight of the weak classifier with a small classification error rate is increased, so that it plays a larger decisive role in the final classification function, and the weak classification with a large classification error rate is reduced. The weight of the classifier so that it plays a less decisive role in the final classification function. Each weak classifier is iteratively trained as described above. The weak classifiers obtained from each training are fused as the final text classification model. The text classification model can be used to classify the sentimental tendency of texts in the financial field, and can be used to judge whether the stock discussion texts in the forum are negative, positive or neutral.

本实施例提出的文本分类模型的生成方法,通过对金融领域的文本语料挖掘,尽可能多的从语料中筛选新的金融领域词语,添加到分词词典中,实现对金融领域分词词典的扩充,并且使用扩充了金融词汇后的分词词典对样本集中的训练样本进行分词处理,并按照预设情感倾向分类模式对样本集中的样本数据进行类别标注,最终训练得到文本分类模型,该模型能够应用于金融领域的情感倾向分类问题。The method for generating a text classification model proposed in this embodiment, by mining the text corpus in the financial field, selects as many new financial field words as possible from the corpus, and adds them to the word segmentation dictionary, so as to realize the expansion of the financial field word segmentation dictionary, And use the word segmentation dictionary with the expanded financial vocabulary to perform word segmentation processing on the training samples in the sample set, and label the sample data in the sample set according to the preset sentiment tendency classification mode, and finally train a text classification model, which can be applied to The problem of sentiment orientation classification in the financial domain.

此外,本发明实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有模型生成程序,所述模型生成程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present invention also provides a computer-readable storage medium, where a model generation program is stored on the computer-readable storage medium, and the model generation program can be executed by one or more processors to achieve the following operations:

获取基于收集的金融领域词汇构建的金融领域的分词词典,以及预设的金融领域的文本语料;Obtain a word segmentation dictionary in the financial field constructed based on the collected vocabulary in the financial field, and a preset text corpus in the financial field;

根据预设算法从所述文本语料中选择候选新词,添加至所述分词词典;According to a preset algorithm, candidate new words are selected from the text corpus and added to the word segmentation dictionary;

获取样本集,按照预设情感倾向分类模式对所述样本集中的训练样本进行类别标注;Obtain a sample set, and perform category labeling on the training samples in the sample set according to a preset emotional tendency classification mode;

基于添加了候选新词的所述分词词典,使用预设的分词算法对所述样本集中的训练样本进行分词处理;Based on the word segmentation dictionary to which candidate new words are added, using a preset word segmentation algorithm to perform word segmentation processing on the training samples in the sample set;

根据分词结果提取词向量,基于adaboost算法,将训练样本对应的词向量和标注的类别信息输入到预设的多个弱分类器中进行训练,将训练得到的多个弱分类器组合为金融领域的文本分类模型。The word vector is extracted according to the word segmentation result. Based on the adaboost algorithm, the word vector corresponding to the training sample and the labeled category information are input into the preset multiple weak classifiers for training, and the multiple weak classifiers obtained by training are combined into the financial field. text classification model.

本发明计算机可读存储介质具体实施方式与上述文本分类模型的生成装置和方法各实施例基本相同,在此不作累述。The specific implementations of the computer-readable storage medium of the present invention are basically the same as the above-mentioned embodiments of the apparatus and method for generating a text classification model, and will not be described in detail here.

需要说明的是,上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprising", "comprising" or any other variation thereof herein are intended to encompass a non-exclusive inclusion such that a process, device, article or method comprising a list of elements includes not only those elements, but also includes no explicit Other elements listed, or those inherent to such a process, apparatus, article, or method are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on such understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disc), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims (8)

1. An apparatus for generating a text classification model, the apparatus comprising a memory and a processor, the memory having stored thereon a model generator executable on the processor, the model generator when executed by the processor implementing the steps of:
acquiring a word segmentation dictionary of the financial field constructed based on the collected financial field vocabularies and a preset text corpus of the financial field;
selecting new candidate words from the text corpus according to a preset algorithm, adding the new candidate words to the segmentation dictionary, selecting new candidate words from the text corpus according to the preset algorithm, and adding the new candidate words to the segmentation dictionary comprises the following steps:
based on the word segmentation dictionary, performing word segmentation processing on the text corpus by using the word segmentation algorithm, and acquiring a candidate word set according to the word segmentation result;
calculating information gain of each candidate word in the candidate word set, selecting a candidate word with the information gain larger than a first preset threshold value as a first candidate new word, and adding the first candidate new word into the word segmentation dictionary;
based on the word segmentation dictionary added with the first candidate new word, performing word segmentation on the text corpus by using the word segmentation algorithm, and training a word vector model by using the text corpus after word segmentation processing;
calculating semantic similarity between the words in the word segmentation result and the first candidate new words by using a word vector model obtained by training;
taking words with semantic similarity larger than a second preset threshold as second candidate new words, and adding the second candidate new words into the word segmentation dictionary; the information gain calculation formula of each candidate word is as follows:
Figure FDA0002705912400000011
in the above formula, P (C)j) Represents class CjProbability of occurrence in the dataset, P (t)i) Representing a feature item tiProbability of occurrence in the data set, P (C)j|ti) Representing a feature item tiPresent in decision class CjThe probability in the document of (a) is,
Figure FDA0002705912400000012
representing a feature item tiThe probability of the non-occurrence of the event,
Figure FDA0002705912400000013
representing a feature item tiPresent in a category other than CjThe probability, | c | is the total number of categories, wherein the categories refer to the classification of emotional tendency, the feature items are candidate words, and the probability values can be obtained by calculating the statistical conditions of the candidate words in the text corpus;
acquiring a sample set, and carrying out class marking on training samples in the sample set according to a preset emotional tendency classification mode;
based on the word segmentation dictionary added with the candidate new words, performing word segmentation processing on training samples in the sample set by using a preset word segmentation algorithm;
extracting word vectors according to word segmentation results, inputting the word vectors corresponding to training samples and labeled class information into a plurality of preset weak classifiers for training based on an adaboost algorithm, and combining the plurality of weak classifiers obtained by training into a text classification model in the financial field.
2. The apparatus for generating a text classification model according to claim 1, wherein the processor is further operative to execute the model generation program to, after the step of taking the word having a semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, further implement the steps of:
and calculating the word frequency of the second candidate new word in the text corpus, and taking the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.
3. The apparatus for generating a text classification model according to any one of claims 1 or 2, wherein the step of obtaining a sample set and labeling training samples in the sample set according to a preset emotional tendency classification mode comprises:
the method comprises the steps of obtaining a sample set, obtaining a plurality of marking information obtained by marking training samples in the sample set by a plurality of marking persons according to a preset emotional tendency classification mode, and selecting the marking information with the largest occurrence frequency from the marking information as a marking result of the corresponding training sample.
4. The apparatus for generating a text classification model according to any one of claims 1 or 2, characterized in that the weak classifiers include a convolutional neural network algorithm-based classifier, a cyclic neural network algorithm-based classifier, and a long-short term memory network algorithm-based classifier.
5. A method for generating a text classification model, the method comprising:
acquiring a word segmentation dictionary of the financial field constructed based on the collected financial field vocabularies and a preset text corpus of the financial field;
selecting new candidate words from the text corpus according to a preset algorithm, adding the new candidate words to the segmentation dictionary, selecting new candidate words from the text corpus according to the preset algorithm, and adding the new candidate words to the segmentation dictionary comprises the following steps:
based on the word segmentation dictionary, performing word segmentation processing on the text corpus by using the word segmentation algorithm, and acquiring a candidate word set according to the word segmentation result;
calculating information gain of each candidate word in the candidate word set, selecting a candidate word with the information gain larger than a first preset threshold value as a first candidate new word, and adding the first candidate new word into the word segmentation dictionary;
based on the word segmentation dictionary added with the first candidate new word, performing word segmentation on the text corpus by using the word segmentation algorithm, and training a word vector model by using the text corpus after word segmentation processing;
calculating semantic similarity between the words in the word segmentation result and the first candidate new words by using a word vector model obtained by training;
taking words with semantic similarity larger than a second preset threshold as second candidate new words, and adding the second candidate new words into the word segmentation dictionary; the information gain calculation formula of each candidate word is as follows:
Figure FDA0002705912400000031
in the above formula, P (C)j) Represents class CjProbability of occurrence in the dataset, P (t)i) Representing a feature item tiProbability of occurrence in the data set, P (C)j|ti) Representing a feature item tiPresent in decision class CjThe probability in the document of (a) is,
Figure FDA0002705912400000032
representing a feature item tiThe probability of the non-occurrence of the event,
Figure FDA0002705912400000033
representing a feature item tiPresent in a category other than CjThe probability, | c | is the total number of categories, wherein the categories refer to the classification of emotional tendency, the feature items are candidate words, and the probability values can be obtained by calculating the statistical conditions of the candidate words in the text corpus;
acquiring a sample set, and carrying out class marking on training samples in the sample set according to a preset emotional tendency classification mode;
based on the word segmentation dictionary added with the candidate new words, performing word segmentation processing on training samples in the sample set by using a preset word segmentation algorithm;
extracting word vectors according to word segmentation results, inputting the word vectors corresponding to training samples and labeled class information into a plurality of preset weak classifiers for training based on an adaboost algorithm, and combining the plurality of weak classifiers obtained by training into a text classification model in the financial field.
6. The method for generating a text classification model according to claim 5, wherein after the step of taking the word with semantic similarity greater than a second preset threshold as a second candidate new word and adding the second candidate new word to the word dictionary, the method further comprises the steps of:
and calculating the word frequency of the second candidate new word in the text corpus, and taking the calculated word frequency as the weight of the second candidate new word in the word segmentation dictionary.
7. The method for generating the text classification model according to any one of claims 5 or 6, wherein the step of obtaining the sample set and labeling the training samples in the sample set according to the preset emotional tendency classification mode comprises the following steps:
the method comprises the steps of obtaining a sample set, obtaining a plurality of marking information obtained by marking training samples in the sample set by a plurality of marking persons according to a preset emotional tendency classification mode, and selecting the marking information with the largest occurrence frequency from the marking information as a marking result of the corresponding training sample.
8. A computer-readable storage medium having stored thereon a model generation program executable by one or more processors to perform the steps of the method of generating a text classification model according to any one of claims 5 to 7.
CN201810361702.0A 2018-04-20 2018-04-20 Apparatus, method and computer-readable storage medium for generating text classification model Active CN108804512B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810361702.0A CN108804512B (en) 2018-04-20 2018-04-20 Apparatus, method and computer-readable storage medium for generating text classification model
PCT/CN2018/102400 WO2019200806A1 (en) 2018-04-20 2018-08-27 Device for generating text classification model, method, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810361702.0A CN108804512B (en) 2018-04-20 2018-04-20 Apparatus, method and computer-readable storage medium for generating text classification model

Publications (2)

Publication Number Publication Date
CN108804512A CN108804512A (en) 2018-11-13
CN108804512B true CN108804512B (en) 2020-11-24

Family

ID=64093733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810361702.0A Active CN108804512B (en) 2018-04-20 2018-04-20 Apparatus, method and computer-readable storage medium for generating text classification model

Country Status (2)

Country Link
CN (1) CN108804512B (en)
WO (1) WO2019200806A1 (en)

Families Citing this family (142)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299276B (en) * 2018-11-15 2021-11-19 创新先进技术有限公司 Method and device for converting text into word embedding and text classification
CN109614499B (en) * 2018-11-22 2023-02-17 创新先进技术有限公司 Dictionary generation method, new word discovery method, device and electronic equipment
CN109783800B (en) * 2018-12-13 2024-04-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for acquiring emotional keywords
CN109684634B (en) * 2018-12-17 2023-07-25 北京百度网讯科技有限公司 Emotion analysis method, device, equipment and storage medium
CN109741190A (en) * 2018-12-27 2019-05-10 清华大学 A method, system and equipment for the classification of individual stock announcements
CN111401030B (en) * 2018-12-28 2024-01-09 北京嘀嘀无限科技发展有限公司 Method and device for identifying service abnormality, server and readable storage medium
CN109685156B (en) * 2018-12-30 2021-11-05 杭州灿八科技有限公司 Method for acquiring classifier for recognizing emotion
CN110008464A (en) * 2019-01-02 2019-07-12 阿里巴巴集团控股有限公司 Construction method, device, server and the readable storage medium storing program for executing of business dictionary
CN109871444A (en) * 2019-01-16 2019-06-11 北京邮电大学 A text classification method and system
CN109871889B (en) * 2019-01-31 2019-12-24 内蒙古工业大学 Methods of Public Psychological Evaluation in Emergency Events
CN109783632B (en) * 2019-02-15 2023-07-18 腾讯科技(深圳)有限公司 Customer service information pushing method and device, computer equipment and storage medium
CN110059187B (en) * 2019-04-10 2022-06-07 华侨大学 A Deep Learning Text Classification Method Integrating Shallow Semantic Prediction Modalities
CN110232914A (en) * 2019-05-20 2019-09-13 平安普惠企业管理有限公司 A kind of method for recognizing semantics, device and relevant device
CN110347821B (en) * 2019-05-29 2023-08-25 华东理工大学 Text category labeling method, electronic equipment and readable storage medium
CN110210028B (en) * 2019-05-30 2023-04-28 杭州远传新业科技股份有限公司 Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN110188204B (en) * 2019-06-11 2022-10-04 腾讯科技(深圳)有限公司 Extended corpus mining method and device, server and storage medium
CN110674289A (en) * 2019-07-04 2020-01-10 南瑞集团有限公司 Method, device and storage medium for judging the category to which an article belongs based on word segmentation weight
CN110457475B (en) * 2019-07-25 2023-06-30 创新先进技术有限公司 Method and system for text classification system construction and annotation corpus expansion
CN110489556A (en) * 2019-08-22 2019-11-22 重庆锐云科技有限公司 Quality evaluating method, device, server and storage medium about follow-up record
CN112445907B (en) * 2019-09-02 2024-10-15 顺丰科技有限公司 Text emotion classification method, device, equipment and storage medium
CN110704581B (en) * 2019-09-11 2024-03-08 创新先进技术有限公司 Text emotion analysis method and device executed by computer
CN110597958B (en) * 2019-09-12 2022-03-25 思必驰科技股份有限公司 Text classification model training and using method and device
CN112579768A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Emotion classification model training method, text emotion classification method and text emotion classification device
CN110837732B (en) * 2019-10-31 2024-01-26 北京奇艺世纪科技有限公司 Method and device for identifying intimacy between target persons, electronic equipment and storage medium
CN110879934B (en) * 2019-10-31 2023-05-23 杭州电子科技大学 Text prediction method based on Wide & Deep learning model
CN111104510B (en) * 2019-11-15 2023-05-09 南京中新赛克科技有限责任公司 Text classification training sample expansion method based on word embedding
CN111125323B (en) * 2019-11-21 2024-01-19 腾讯科技(深圳)有限公司 Chat corpus labeling method and device, electronic equipment and storage medium
CN110990567A (en) * 2019-11-25 2020-04-10 国家电网有限公司 A power audit text classification method with enhanced domain features
CN112861533A (en) * 2019-11-26 2021-05-28 阿里巴巴集团控股有限公司 Entity word recognition method and device
CN111046177A (en) * 2019-11-26 2020-04-21 方正璞华软件(武汉)股份有限公司 Automatic arbitration case prejudging method and device
CN110968702B (en) * 2019-11-29 2023-05-09 北京明略软件系统有限公司 Method and device for extracting rational relation
CN110991612A (en) * 2019-11-29 2020-04-10 交通银行股份有限公司 Message analysis method of international routine real-time reasoning model based on word vector
CN111078546B (en) * 2019-12-05 2023-06-16 北京云聚智慧科技有限公司 Page feature expression method and electronic equipment
CN111078883A (en) * 2019-12-13 2020-04-28 北京明略软件系统有限公司 Risk index analysis method and device, electronic equipment and storage medium
CN111177403B (en) * 2019-12-16 2023-06-23 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN111191119B (en) * 2019-12-16 2023-12-12 绍兴市上虞区理工高等研究院 Neural network-based scientific and technological achievement self-learning method and device
CN112989180A (en) * 2019-12-16 2021-06-18 中国电信股份有限公司 Push message generation method and device and computer readable storage medium
CN112989032A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Entity relationship classification method, apparatus, medium and electronic device
CN111177378B (en) * 2019-12-20 2023-09-26 北京淇瑀信息科技有限公司 Text mining method and device and electronic equipment
CN111309855A (en) * 2019-12-24 2020-06-19 中国银行股份有限公司 Method and system for processing text information
WO2021127987A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Polyphonic character prediction method and disambiguation method, apparatuses, device and computer readable storage medium
CN111144097B (en) * 2019-12-25 2023-08-18 华中科技大学鄂州工业技术研究院 Modeling method and device for emotion tendency classification model of dialogue text
CN113052191A (en) * 2019-12-26 2021-06-29 航天信息股份有限公司 Training method, device, equipment and medium of neural language network model
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification
CN111221950B (en) * 2019-12-30 2024-09-20 航天信息股份有限公司 Method and device for analyzing weak emotion of user
CN111159589B (en) * 2019-12-30 2023-10-20 中国银联股份有限公司 Classification dictionary establishment method, merchant data classification method, device and equipment
CN114556328B (en) * 2019-12-31 2024-07-16 深圳市欢太科技有限公司 Data processing method, device, electronic equipment and storage medium
CN111143569B (en) * 2019-12-31 2023-05-02 腾讯科技(深圳)有限公司 Data processing method, device and computer readable storage medium
CN111198948B (en) * 2020-01-08 2024-06-14 深圳前海微众银行股份有限公司 Text classification correction method, apparatus, device and computer readable storage medium
CN111259148B (en) * 2020-01-19 2024-03-26 北京小米松果电子有限公司 Information processing method, device and storage medium
CN113221549A (en) * 2020-01-21 2021-08-06 中国电信股份有限公司 Word type labeling method and device and storage medium
CN111309859B (en) * 2020-01-21 2023-07-07 上饶市中科院云计算中心大数据研究院 A sentiment analysis method and device for network word-of-mouth in scenic spots
CN111310464B (en) * 2020-02-17 2024-02-02 北京明略软件系统有限公司 Word vector acquisition model generation method and device and word vector acquisition method and device
CN111325562B (en) * 2020-02-17 2023-08-01 武汉轻工大学 Grain safety traceability system and method
CN111339268B (en) * 2020-02-19 2023-08-15 北京百度网讯科技有限公司 Entity word recognition method and device
CN111367962B (en) * 2020-02-28 2024-01-30 北京金堤科技有限公司 Database updating method and device, computer readable storage medium and electronic equipment
CN111523308B (en) * 2020-03-18 2024-01-26 大箴(杭州)科技有限公司 Chinese word segmentation method and device and computer equipment
CN111325033B (en) * 2020-03-20 2023-07-11 中国建设银行股份有限公司 Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN113449097B (en) * 2020-03-24 2024-07-12 百度在线网络技术(北京)有限公司 Method and device for generating countermeasure sample, electronic equipment and storage medium
CN111309920B (en) * 2020-03-26 2023-03-24 清华大学深圳国际研究生院 Text classification method, terminal equipment and computer readable storage medium
CN111444326B (en) * 2020-03-30 2023-10-20 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN111680225B (en) * 2020-04-26 2023-08-18 国家计算机网络与信息安全管理中心 WeChat financial message analysis method and system based on machine learning
CN113111175A (en) * 2020-04-28 2021-07-13 北京明亿科技有限公司 Extreme behavior identification method, device, equipment and medium based on deep learning model
CN111652281B (en) * 2020-04-30 2023-08-18 中国平安财产保险股份有限公司 Information data classification method, device and readable storage medium
CN111680155A (en) * 2020-05-13 2020-09-18 新华网股份有限公司 Text classification method, device, electronic device and computer storage medium
CN111737993B (en) * 2020-05-26 2024-04-02 浙江华云电力工程设计咨询有限公司 Method for extracting equipment health state from fault defect text of power distribution network equipment
CN111709233B (en) * 2020-05-27 2023-04-18 西安交通大学 Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
CN111368555B (en) * 2020-05-27 2020-08-28 腾讯科技(深圳)有限公司 Data identification method and device, storage medium and electronic equipment
CN111601314B (en) * 2020-05-27 2023-04-28 北京亚鸿世纪科技发展有限公司 Method and device for double judging bad short message by pre-training model and short message address
CN111680803B (en) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 A transport inspection work ticket generation system
CN111680804B (en) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 A method, device, and computer-readable medium for generating a work ticket for transportation inspection
CN111832292B (en) * 2020-06-03 2024-02-02 北京百度网讯科技有限公司 Text recognition processing method, device, electronic equipment and storage medium
CN113761180A (en) * 2020-06-04 2021-12-07 海信集团有限公司 A text classification method, device, equipment and medium
CN111782803B (en) * 2020-06-05 2024-06-18 京东科技控股股份有限公司 Work order processing method and device, electronic equipment and storage medium
CN113761882B (en) * 2020-06-08 2024-09-20 北京沃东天骏信息技术有限公司 Dictionary construction method and device
CN111737999B (en) * 2020-06-24 2024-10-11 深圳前海微众银行股份有限公司 Sequence labeling method, device, equipment and readable storage medium
CN111783451B (en) * 2020-06-30 2024-07-02 不亦乐乎有朋(北京)科技有限公司 Method and apparatus for enhancing text samples
CN111753091B (en) * 2020-06-30 2024-09-03 北京小米松果电子有限公司 Classification method, training device, training equipment and training storage medium for classification model
CN114004234B (en) * 2020-07-28 2024-11-22 深圳Tcl数字技术有限公司 A semantic recognition method, storage medium and terminal device
CN111930942B (en) * 2020-08-07 2023-08-15 腾讯云计算(长沙)有限责任公司 Text classification method, language model training method, device and equipment
CN111966944B (en) * 2020-08-17 2024-04-09 中电科大数据研究院有限公司 A model construction method for multi-level user review security audit
CN111914554B (en) * 2020-08-19 2024-08-09 网易(杭州)网络有限公司 Training method of domain new word recognition model, domain new word recognition method and device
CN112015895B (en) * 2020-08-26 2024-11-01 广东电网有限责任公司 Patent text classification method and device
CN112101015B (en) * 2020-09-08 2024-01-26 腾讯科技(深圳)有限公司 Method and device for identifying multi-label object
CN112016319B (en) * 2020-09-08 2023-12-15 平安科技(深圳)有限公司 Pre-training model acquisition and disease entity labeling method, device and storage medium
CN114281928A (en) * 2020-09-28 2022-04-05 中国移动通信集团广西有限公司 Model generation method, device and device based on text data
CN113392209B (en) * 2020-10-26 2023-09-19 腾讯科技(深圳)有限公司 Text clustering method based on artificial intelligence, related equipment and storage medium
CN112347786B (en) * 2020-10-27 2025-09-19 阳光保险集团股份有限公司 Artificial intelligence scoring training method and device
CN112287639A (en) * 2020-10-30 2021-01-29 上海中通吉网络技术有限公司 Intelligent customer service work order classification method
CN112364131B (en) * 2020-11-10 2024-05-17 中国平安人寿保险股份有限公司 Corpus processing method and related device thereof
CN112417860A (en) * 2020-12-08 2021-02-26 携程计算机技术(上海)有限公司 Training sample enhancement method, system, device and storage medium
CN112528022A (en) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112633344B (en) * 2020-12-16 2024-10-22 中国平安财产保险股份有限公司 Quality inspection model training method, device, equipment and readable storage medium
CN112529743B (en) * 2020-12-18 2023-08-08 平安银行股份有限公司 Contract element extraction method, device, electronic equipment and medium
CN112632971B (en) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112650837B (en) * 2020-12-28 2023-12-12 上海秒针网络科技有限公司 Text quality control method and system combining classification algorithm and unsupervised algorithm
CN112765936B (en) * 2020-12-31 2024-02-23 出门问问(武汉)信息科技有限公司 Training method and device for operation based on language model
CN112784061B (en) * 2021-01-27 2024-08-09 数贸科技(北京)有限公司 Knowledge graph construction method and device, computing equipment and storage medium
CN112926631A (en) * 2021-02-01 2021-06-08 大箴(杭州)科技有限公司 Financial text classification method and device and computer equipment
CN112948573B (en) * 2021-02-05 2024-04-02 北京百度网讯科技有限公司 Text label extraction method, device, equipment and computer storage medium
CN112948583A (en) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 Data classification method and device, storage medium and electronic device
CN113011183B (en) * 2021-03-23 2023-09-05 北京科东电力控制系统有限责任公司 A method and system for processing unstructured text data in the field of power regulation
CN113033198B (en) * 2021-03-25 2022-08-26 平安国际智慧城市科技股份有限公司 Similar text pushing method and device, electronic equipment and computer storage medium
CN113051401A (en) * 2021-04-06 2021-06-29 明品云(北京)数据科技有限公司 Text structured labeling method, system, device and medium
CN113095068A (en) * 2021-04-30 2021-07-09 平安国际智慧城市科技股份有限公司 Emotion analysis method, system and device based on weight dictionary and storage medium
CN113032573B (en) * 2021-04-30 2024-01-23 同方知网数字出版技术股份有限公司 Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
CN113240485B (en) * 2021-05-10 2024-09-20 北京沃东天骏信息技术有限公司 Training method of text generation model, text generation method and device
CN113177109B (en) * 2021-05-27 2024-07-16 中国平安人寿保险股份有限公司 Weak labeling method, device, equipment and storage medium for text
CN113468292B (en) * 2021-06-29 2024-06-25 中国银联股份有限公司 Aspect-level emotion analysis method, device and computer-readable storage medium
CN113377965B (en) * 2021-06-30 2024-02-23 中国农业银行股份有限公司 Method and related device for sensing text keywords
CN113627530B (en) * 2021-08-11 2023-09-15 中国平安人寿保险股份有限公司 Similar problem text generation method, device, equipment and medium
CN113723114A (en) * 2021-08-31 2021-11-30 平安普惠企业管理有限公司 Semantic analysis method, device and equipment based on multi-intent recognition and storage medium
CN113642678B (en) * 2021-10-12 2022-01-07 南京山猫齐动信息技术有限公司 Method, device and storage medium for generating confrontation message sample
US12050873B2 (en) * 2021-10-28 2024-07-30 Sap Se Semantic duplicate normalization and standardization
CN114091469B (en) * 2021-11-23 2022-08-19 杭州萝卜智能技术有限公司 Network public opinion analysis method based on sample expansion
CN114090601B (en) * 2021-11-23 2023-11-03 北京百度网讯科技有限公司 Data screening method, device, equipment and storage medium
CN114117043A (en) * 2021-11-24 2022-03-01 阿里巴巴(中国)有限公司 Model training method and device and computer storage medium
CN114064904A (en) * 2021-11-28 2022-02-18 河南大学 A clustering method, system and device for medical text
CN114398482B (en) * 2021-12-06 2025-05-09 腾讯数码(天津)有限公司 Dictionary construction method, device, electronic device and storage medium
CN114417984A (en) * 2021-12-31 2022-04-29 上海流利说信息技术有限公司 Training sample enhancement method and device, device and storage medium
CN114416983A (en) * 2022-01-12 2022-04-29 赣南科技学院 News text import automatic calculation and classification system
CN114638195B (en) * 2022-01-21 2022-11-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-task learning-based ground detection method
CN114443849B (en) 2022-02-09 2023-10-27 北京百度网讯科技有限公司 Annotated sample selection method, device, electronic equipment and storage medium
CN114491048B (en) * 2022-02-16 2025-08-15 北京微播易科技股份有限公司 Data enhancement method, training method and device for text classification model
CN114580395B (en) * 2022-02-21 2025-06-27 腾讯科技(深圳)有限公司 Text segmentation method, device, computer equipment and computer readable storage medium
CN114662492A (en) * 2022-04-13 2022-06-24 广州欢聚时代信息科技有限公司 Product word processing method and device, equipment, medium and product thereof
CN114707042B (en) * 2022-04-13 2024-10-25 国家计算机网络与信息安全管理中心 Application software classification method and device, electronic equipment and readable storage medium
CN114691835A (en) * 2022-04-21 2022-07-01 广东电网有限责任公司 Audit plan data generation method, device and equipment based on text mining
CN114936282B (en) * 2022-04-28 2024-06-11 北京中科闻歌科技股份有限公司 Financial risk cue determination method, device, equipment and medium
CN115861606B (en) * 2022-05-09 2023-09-08 北京中关村科金技术有限公司 Classification method, device and storage medium for long-tail distributed documents
CN115375133A (en) * 2022-08-19 2022-11-22 广东电网有限责任公司 Power secondary equipment state evaluation method and device and storage medium
CN115374865A (en) * 2022-08-26 2022-11-22 科大讯飞股份有限公司 Training data processing method, device, equipment and readable medium
CN115563965A (en) * 2022-09-29 2023-01-03 携程旅游信息技术(上海)有限公司 Tourist entity noun matching method, system, equipment and storage medium
CN116206318A (en) * 2022-09-30 2023-06-02 携程计算机技术(上海)有限公司 Hotel evaluation feedback method, system, equipment and storage medium
CN116307792B (en) * 2022-10-12 2024-03-12 广州市阿尔法软件信息技术有限公司 Urban physical examination subject scene-oriented evaluation method and device
CN115934938A (en) * 2022-11-30 2023-04-07 中国农业银行股份有限公司 Text type determination method and device and electronic equipment
CN115952290B (en) * 2023-03-09 2023-06-02 太极计算机股份有限公司 Case characteristic labeling method, device and equipment based on active learning and semi-supervised learning
CN116361463B (en) * 2023-03-27 2023-12-08 应急管理部国家减灾中心(应急管理部卫星减灾应用中心) Earthquake disaster information extraction method, device, equipment and medium
CN117093715B (en) * 2023-10-18 2023-12-29 湖南财信数字科技有限公司 Word stock expansion method, system, computer equipment and storage medium
CN117609432A (en) * 2023-12-21 2024-02-27 中国疾病预防控制中心慢性非传染性疾病预防控制中心 A method to realize policy intelligent retrieval through tag extraction strategy
CN119624950A (en) * 2025-02-11 2025-03-14 重庆医科大学绍兴柯桥医学检验技术研究中心 Intelligent fundus image analysis method and system
CN120493919B (en) * 2025-07-18 2025-09-23 天津汇智盈金科技有限公司 A text analysis method and system combining knowledge base and knowledge graph

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN106598940A (en) * 2016-11-01 2017-04-26 四川用联信息技术有限公司 Text similarity solution algorithm based on global optimization of keyword quality
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
WO2017202125A1 (en) * 2016-05-25 2017-11-30 华为技术有限公司 Text classification method and apparatus
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Sentiment Classification Method of Chinese Internet Reviews Based on Ensemble Learning Framework

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023967A (en) * 2010-11-11 2011-04-20 清华大学 Text emotion classifying method in stock field
US8676730B2 (en) * 2011-07-11 2014-03-18 Accenture Global Services Limited Sentiment classifiers based on feature extraction
CN104142913A (en) * 2013-05-07 2014-11-12 株式会社日立制作所 The Discrimination Method and Discrimination System of Word Polarity
CN103559174B (en) * 2013-09-30 2016-03-09 东软集团股份有限公司 Semantic emotion classification characteristic value extraction and system
CN104331506A (en) * 2014-11-20 2015-02-04 北京理工大学 Multiclass emotion analyzing method and system facing bilingual microblog text
WO2016085409A1 (en) * 2014-11-24 2016-06-02 Agency For Science, Technology And Research A method and system for sentiment classification and emotion classification
CN105022725B (en) * 2015-07-10 2018-04-20 河海大学 A kind of text emotion trend analysis method applied to finance Web fields
RU2657173C2 (en) * 2016-07-28 2018-06-08 Общество с ограниченной ответственностью "Аби Продакшн" Sentiment analysis at the level of aspects using methods of machine learning
CN106547738B (en) * 2016-11-02 2019-05-07 北京亿美软通科技有限公司 A kind of overdue short message intelligent method of discrimination of financial class based on text mining
CN107729374A (en) * 2017-09-13 2018-02-23 厦门快商通科技股份有限公司 A kind of extending method of sentiment dictionary and text emotion recognition methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
WO2017202125A1 (en) * 2016-05-25 2017-11-30 华为技术有限公司 Text classification method and apparatus
CN106598940A (en) * 2016-11-01 2017-04-26 四川用联信息技术有限公司 Text similarity solution algorithm based on global optimization of keyword quality
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Sentiment Classification Method of Chinese Internet Reviews Based on Ensemble Learning Framework

Also Published As

Publication number Publication date
CN108804512A (en) 2018-11-13
WO2019200806A1 (en) 2019-10-24

Similar Documents

Publication Publication Date Title
CN108804512B (en) Apparatus, method and computer-readable storage medium for generating text classification model
CN106708966B (en) Spam comment detection method based on similarity calculation
US8688690B2 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN110597994A (en) Event element identification method and device
CN109471942B (en) Chinese comment sentiment classification method and device based on evidence inference rules
CN103207914B (en) The preference vector evaluated based on user feedback generates method and system
CN112632226B (en) Semantic search method and device based on legal knowledge graph and electronic equipment
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
WO2017166912A1 (en) Method and device for extracting core words from commodity short text
JP2012118977A (en) Method and system for machine-learning based optimization and customization of document similarity calculation
CN105279495A (en) Video description method based on deep learning and text summarization
CN111046656A (en) Text processing method and device, electronic equipment and readable storage medium
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
WO2022125096A1 (en) Method and system for resume data extraction
Wei et al. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments
CN105787662A (en) Mobile application software performance prediction method based on attributes
CN114637831A (en) Data query method and related equipment based on semantic analysis
CN115114916A (en) User feedback data analysis method and device and computer equipment
CN115269816A (en) Core personnel mining method and device based on information processing method and storage medium
CN111382254A (en) Electronic business card recommendation method, device, equipment and computer readable storage medium
CN113360602A (en) Method, apparatus, device and storage medium for outputting information
CN118820403A (en) Policy text denoising and related matters extraction method and system based on large model
CN112307200B (en) Emotional attribute acquisition method, device, equipment, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant