CN108804512A

CN108804512A - Generating means, method and the computer readable storage medium of textual classification model

Info

Publication number: CN108804512A
Application number: CN201810361702.0A
Authority: CN
Inventors: 王健宗; 吴天博; 黄章成; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2018-11-13
Anticipated expiration: 2038-04-20
Also published as: CN108804512B; WO2019200806A1

Abstract

The invention discloses a kind of generating means of textual classification model, including memory and processor, it is stored with the model generator that can be run on a processor on memory, which realizes following steps when being executed by processor：Obtain the dictionary for word segmentation of financial field and the corpus of text of financial field；Candidate neologisms are selected to be added to dictionary for word segmentation from corpus of text；It obtains sample set and classification mark is carried out to the training sample in sample set；Based on the dictionary for word segmentation for being added to candidate neologisms, the training sample in sample set is segmented using preset segmentation methods and extracts term vector, based on adaboost algorithms, the classification information of term vector and mark is input to training in multiple Weak Classifiers, obtains textual classification model.The present invention also proposes a kind of generation method of textual classification model and a kind of computer readable storage medium.The present invention solves the problems, such as to cannot achieve the classification for carrying out financial field text emotion tendency in the prior art.

Description

Generating means, method and the computer readable storage medium of textual classification model

Technical field

The present invention relates to Text Classification field more particularly to a kind of generating means of textual classification model, method and Computer readable storage medium.

Background technology

With the development of internet and information technology, more and more mechanisms are with individual by internet approach with various sides Formula delivers the viewpoint, attitude and position to various things, such as various news analysis, forum and social network sites.These magnanimity Information for the various aspects such as e-commerce, market prediction have certain commercial value, especially financial industry, be interconnection Net information increases most fast, impacted maximum industry, therefore, Sentiment orientation analysis is carried out to carry out more to financial text message In-depth study is increasingly becoming important topic.

Emotion tendentiousness of text analysis is to belong to a part for text emotion analysis, can be with by emotional orientation analysis Grasp this paper and pass judgement on sexual orientation, for financial field, news public sentiment be embody market and industry prosperity degree and The important indicator of the transaction enthusiasm of investor, therefore, when to the analysis of the emotion tendency of the text of financial field for finance Long influence of the research with play staff's weight, but also lack realization in the prior art and Sentiment orientation is carried out to financial field text Classification scheme, lead to not to realize the classification that emotion tendency is carried out to financial field text.

Invention content

The present invention provides a kind of generating means of textual classification model, method and computer readable storage medium, main Purpose is to propose a kind of generating means of the textual classification model for the Sentiment orientation classification can be used for financial field text, with It solves the problems, such as to cannot achieve the classification for carrying out financial field text emotion tendency in the prior art.

To achieve the above object, the present invention provides a kind of generating means of textual classification model, which includes memory And processor, the model generator that can be run on the processor is stored in the memory, the model generates journey Sequence realizes following steps when being executed by the processor：

Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset financial field Corpus of text；

Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation；

Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode Note；

Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the instruction in the sample set Practice sample and carries out word segmentation processing；

Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the corresponding term vector of training sample and mark Classification information be input in preset multiple Weak Classifiers and be trained, multiple Weak Classifiers that training obtain are combined as gold Melt the textual classification model in field.

Optionally, described to select candidate neologisms from the corpus of text according to preset algorithm, it is added to the participle word The step of allusion quotation includes：

Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to described Word segmentation result obtains candidate word set；

The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the first predetermined threshold value Candidate word as the first candidate neologisms, the described first candidate neologisms are added in the dictionary for word segmentation；

Based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to the corpus of text into Row participle, and train term vector model using the corpus of text after word segmentation processing；

The semantic phase of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains Like degree；

Semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms, and will the second candidate neologisms It is added in the dictionary for word segmentation.

Optionally, the processor can also be used to execute the model generator, with described that semantic similarity is big In the second predetermined threshold value word as the second candidate neologisms, and the step of the described second candidate neologisms are added to institute's predicate dictionary Later, following steps are also realized：

Word frequency of the described second candidate neologisms in corpus of text is calculated, and using the word frequency being calculated as second time Select weight of the neologisms in the dictionary for word segmentation.

Optionally, the acquisition sample set, according to default Sentiment orientation classification mode to the training sample in the sample set The step of this progress classification mark includes：

Sample set is obtained, and obtains multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set Originally the multiple markup informations being labeled, from the multiple markup information, the most markup information of selection occurrence number Annotation results as corresponding training sample.

Optionally, the Weak Classifier includes grader based on convolutional neural networks algorithm, is based on Recognition with Recurrent Neural Network The grader of algorithm and grader based on shot and long term memory network algorithm.

In addition, to achieve the above object, the present invention also provides a kind of generation method of textual classification model, this method packets It includes：

Optionally, described that semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms, and will be described After the step of second candidate neologisms are added to institute's predicate dictionary, the method further includes step：

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Model generator is stored on storage medium, the model generator can be executed by one or more processor, with reality Now the step of generation method of textual classification model as described above.

Generating means, method and the computer readable storage medium of textual classification model proposed by the present invention, based on collection Financial field vocabulary structure financial field dictionary for word segmentation, the corpus of text of preset financial field is obtained, according to text language Material obtains candidate word set, and candidate neologisms is selected to be added to dictionary for word segmentation from candidate word set.Sample set is obtained, according to default Sentiment orientation classification mode carries out classification mark to the training sample in sample set.Based on the participle word for being added to candidate neologisms Allusion quotation carries out word segmentation processing to the training sample in sample set using preset segmentation methods, term vector is extracted according to word segmentation result, Based on adaboost algorithms, the classification information of the corresponding term vector of training sample and mark is input to preset multiple weak typings It is trained in device, multiple Weak Classifiers that training obtains is combined as to the textual classification model of financial field.The side of the present invention In case, by the corpus of text excavation to financial field, filters out candidate neologisms and be added in dictionary for word segmentation, by updated Dictionary for word segmentation is to the sample word segmentation processing in sample set, and according to default Sentiment orientation classification mode to the sample number in sample set According to classification mark is carried out, final training obtains textual classification model, which can be applied to the Sentiment orientation point of financial field Class problem.

Description of the drawings

Fig. 1 is the schematic diagram of the generating means preferred embodiment of textual classification model of the present invention；

Fig. 2 illustrates for the program module of model generator in one embodiment of generating means of textual classification model of the present invention Figure；

Fig. 3 is the flow chart of the generation method preferred embodiment of textual classification model of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific implementation mode

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The present invention provides a kind of generating means of textual classification model.It is textual classification model of the present invention shown in referring to Fig.1 Generating means preferred embodiment schematic diagram.

In the present embodiment, the generating means of textual classification model can be PC (Personal Computer, personal electricity Brain), can also be the terminal devices such as smart mobile phone, tablet computer, pocket computer.The generating means 1 of text disaggregated model Including at least memory 11, processor 12, communication bus 13 and network interface 14.

Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), magnetic storage, disk, CD etc..Memory 11 Can be the internal storage unit of the generating means 1 of textual classification model, such as text disaggregated model in some embodiments Generating means 1 hard disk.Memory 11 can also be the outer of the generating means 1 of textual classification model in further embodiments The plug-in type hard disk being equipped in portion's storage device, such as the generating means 1 of textual classification model, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, Memory 11 can also both include textual classification model generating means 1 internal storage unit and also including External memory equipment. Memory 11 can be not only used for the application software and Various types of data that storage is installed on the generating means 1 of textual classification model, example Such as code of model generator 01 can be also used for temporarily storing the data that has exported or will export.

Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11 Code or processing data, such as execute model generator 01 etc..

Communication bus 13 is for realizing the connection communication between these components.

Network interface 14 may include optionally standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in Communication connection is generated between the device 1 and other electronic equipments.

Fig. 1 illustrates only the generating means 1 of the textual classification model with component 11-14 and model generator 01, It should be understood that be not required for implementing all components shown, the implementation that can be substituted is more or less component.

Optionally, which can also include user interface, and user interface may include display (Display), input Unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.It is optional Ground, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Wherein, display can also be appropriate Referred to as display screen or display unit, for being shown in the information handled in the generating means 1 of textual classification model and for showing Show visual user interface.

In 1 embodiment of device shown in Fig. 1, model generator 01 is stored in memory 11；Processor 12 executes Following steps are realized when the model generator 01 stored in memory 11：

A1, the dictionary for word segmentation for obtaining the financial field that the financial field vocabulary based on collection is built and preset finance The corpus of text in field.

First, full field dictionary for word segmentation is obtained, on the basis of full field dictionary for word segmentation, the financial field of collection is added Vocabulary constitutes financial field dictionary for word segmentation.Wherein, the vocabulary source of financial field includes mainly following three classes：Financial field profession Term, such as " William's index ", " Moving Average ", " transferable bond " etc.；Financial forum's term, such as the use in some speculation in stocks forums Term of the family when commenting on stock；Cyberspeak and special symbol, such as " rubbish stock " etc. applied to financial field.

A2, candidate neologisms are selected from the corpus of text according to preset algorithm, is added to the dictionary for word segmentation.

On the basis of above-mentioned dictionary for word segmentation, candidate neologisms is selected to be added to current participle word in expecting from new text In allusion quotation.Specifically, step A2 includes：

A21, it is based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to The word segmentation result obtains candidate word set；A22, the information gain for calculating each candidate word in the candidate word set, selection Information gain is more than the candidate word of the first predetermined threshold value as the first candidate neologisms, the described first candidate neologisms is added to described In dictionary for word segmentation；A23, based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to the text This language material is segmented, and trains term vector model using the corpus of text after word segmentation processing；A24, the word obtained using training Vector model calculates the semantic similarity of word and the described first candidate neologisms in word segmentation result；A25, semantic similarity is more than The word of second predetermined threshold value is added to as the second candidate neologisms, and by the described second candidate neologisms in the dictionary for word segmentation.

Obtain the corpus of text for expanding dictionary for word segmentation.Specifically, it is grabbed from financial web site by the way of web crawlers The a large amount of and to be analyzed relevant money article text message of financial theme is taken, corpus of text is formed.To the data crawled It is pre-processed, removes the garbages such as mess code symbol wherein included, web escape symbols, retain text data as text Language material.Next, by way of manually marking a large amount of text datas in corpus of text are carried out with the classification of Sentiment orientation, i.e., Classification markup information is added for text data.

Using current dictionary for word segmentation as the dictionary of preset segmentation methods, word segmentation processing is carried out to corpus of text, then, The stop words in word segmentation result is filtered out according to preset stop words vocabulary, to remove the unrelated vocabulary in result, by remaining Word segmentation result constitutes candidate word set.The classification of the corresponding corresponding text data of classification markup information of word segmentation result marks Information is consistent.

Next, candidate neologisms are selected from candidate word set according to information gain, wherein information gain is that one kind is based on The appraisal procedure of entropy, when being used for feature selecting, measurement be the appearance of some word whether to judging whether a text belongs to The information content that some class is provided；It is defined as the difference that front and back information content occurs in a document in a certain characteristic value, calculation formula For：

In above-mentioned formula, P (C_j) indicate classification C_jThe probability occurred in data set, P (t_i) indicate characteristic item t_iIt appears in Probability in data set, P (C_j|t_i) indicate characteristic item t_iIt appears in and is determined as classification C_jDocument in probability,It indicates Characteristic item t_iThe probability not occurred,Indicate characteristic item t_iIt appears in and is not belonging to classification C_jDocument in probability, | C | it is the sum of classification.Wherein, classification refers to the classification of Sentiment orientation, and characteristic item is candidate word.Above-mentioned probability value can lead to The statistical conditions to candidate word in corpus of text are crossed to be calculated.

The useful degree of candidate word is judged according to the information gain being calculated, the value of information gain is bigger, then to classification It is more useful.The candidate word that information gain in candidate word set is more than the first predetermined threshold value is added to as the first candidate neologisms In current dictionary for word segmentation, the expansion to dictionary for word segmentation is realized.

Based on the above-mentioned dictionary for word segmentation expanded by vocabulary, using same segmentation methods to above-mentioned same corpus of text Word segmentation processing is carried out, word segmentation result is obtained, term vector model is trained using the corpus of text after word segmentation processing, is obtained using training Term vector model calculate the term vector of each word that participle obtains, the word first that word segmentation processing obtains is calculated according to term vector The semantic similarity of candidate neologisms, will be from as the second candidate neologisms if semantic similarity is more than the second predetermined threshold value The second candidate word selected in word segmentation result is added in dictionary for word segmentation, realizes the expansion again to dictionary for word segmentation.

It is understood that after carrying out word segmentation processing to corpus of text using segmentation methods, stop words vocabulary can be passed through The stop words in word segmentation result is deleted, because these stop words noises are big and meaningless to text classification, deleting these words can To improve the order of accuarcy of text classification, while reducing calculation amount when selecting candidate neologisms.

By above-mentioned steps, the expansion three times to dictionary for word segmentation is indeed achieved, for the first time by artificially collecting The mode that mode obtains financial field vocabulary carries out preliminary expansion, is to select neologisms by calculating information gain for the second time, the It is that neologisms are selected by term vector computing semantic similarity again three times.In addition, it is upper one that second is expanded with third time The expansion again carried out on the basis of dictionary for word segmentation after secondary expansion.By way of above-mentioned Dynamic expansion, as much as possible from New financial field vocabulary is screened in language material.Point of the segmentation methods of dictionary for word segmentation for the training sample of disaggregated model is expanded Word, the financial field vocabulary in dictionary for word segmentation is abundanter, then more accurate to the word segmentation result of financial field text, what training obtained The classification accuracy of disaggregated model is also higher.

Optionally, as an implementation, after having selected the second candidate neologisms and being added in dictionary for word segmentation, meter Calculate word frequency of the second candidate neologisms in corpus of text, and the power using word frequency as the second candidate neologisms in dictionary for word segmentation Weight.Same mode may be used for the first candidate neologisms and calculate word frequency and as its weight in dictionary for word segmentation.

A3, sample set is obtained, class is carried out to the training sample in the sample set according to default Sentiment orientation classification mode It does not mark.

The sample set for training text disaggregated model is obtained, for each training data in sample set, is obtained more A mark people, to multiple markup informations of each training data, and selects multiple marks according to default Sentiment orientation classification mode Annotation results of the most markup information of occurrence number as the training data in information.Wherein, user can be according to be analyzed Monetary affair corresponding Sentiment orientation classification mode is set, for example, the text in stock forum is divided into hold, sell and It buys in；It is positive, passive and neutral that stock in microblogging or forum is discussed that text is divided into；Financial and economic news text is divided into Positive, negative sense and neutrality etc..

A4, based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods in the sample set Training sample carry out word segmentation processing.

A5, term vector is extracted according to word segmentation result, adaboost algorithms is based on, by the corresponding term vector of training sample and mark The classification information of note is input in preset multiple Weak Classifiers and is trained, and multiple Weak Classifiers that training obtains are combined as The textual classification model of financial field.

After completing to the mark of training sample, based on the dictionary for word segmentation by repeatedly expanding, default participle is used Algorithm uses the term vector of the term vector model extraction word segmentation result after training for training sample word segmentation processing.It needs to illustrate , the segmentation methods used in the scheme of the present embodiment are always the same algorithm.

In the present embodiment, in order to improve the accuracy of textual classification model, word2vec models and Glove are used respectively (Global Vectors for word representation) model extraction term vector, each word segmentation result can obtain Two kinds of term vectors.In addition, in the present embodiment by based on convolutional neural networks algorithm grader, be based on Recognition with Recurrent Neural Network algorithm Grader and grader based on shot and long term memory network algorithm as Weak Classifier.For each Weak Classifier, respectively Using above two term vector as input, then six weak typing models can essentially be built.Based on adaboost algorithms, use The each Weak Classifier of sample training in sample set.In the training process, if certain sample is accurately classified, It constructs in next sample set, reduces the weights of the sample；If certain sample is not classified accurately, one under construction In sample set, the weights of the sample are improved.The sample set that right value update is crossed be used to train next grader, entirely train Journey is so made iteratively down.In addition, after the training process of each Weak Classifier, small weak point of error in classification rate is increased The weight of class device makes it play larger decisive action in final classification function, and reduces weak point of error in classification rate greatly The weight of class device makes it play smaller decisive action in final classification function.Iteratively training is each as procedure described above A Weak Classifier.The Weak Classifier fusion that each training is obtained, as final textual classification model.The text is classified Model can be used for carrying out financial field text the classification of Sentiment orientation, for judging that the stock in forum discusses that text is to disappear Pole, positive or neutrality etc..

The generating means for the textual classification model that the present embodiment proposes, by the corpus of text excavation to financial field, to the greatest extent New financial field word is screened in slave language material more than possible, is added in dictionary for word segmentation, is realized to financial field dictionary for word segmentation Expansion, and word segmentation processing is carried out to the training sample in sample set using the dictionary for word segmentation after financial vocabulary has been expanded, and Classification mark is carried out to the sample data in sample set according to default Sentiment orientation classification mode, final training obtains text classification Model, the model can be applied to the Sentiment orientation classification problem of financial field.

Optionally, in other examples, model generator can also be divided into one or more module, and one A or multiple modules are stored in memory 11, and are held by one or more processors (the present embodiment is by processor 12) For row to complete the present invention, the so-called module of the present invention is the series of computation machine program instruction section for referring to complete specific function, Implementation procedure of the program in the generating means of textual classification model is generated for descriptive model.

Shown in Fig. 2, journey is generated for the model in one embodiment of generating means of textual classification model of the present invention The program module schematic diagram of sequence, in the embodiment, model generator can be divided into data acquisition module 10, neologisms selection Module 20, sample labeling module 30, sample word-dividing mode 40 and model training module 50, illustratively：

Data acquisition module 10 is used for：Obtain the participle word of the financial field of the financial field vocabulary structure based on collection The corpus of text of allusion quotation and preset financial field；

Neologisms selecting module 20 is used for：Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to institute State dictionary for word segmentation；

Sample labeling module 30 is used for：Sample set is obtained, according to default Sentiment orientation classification mode in the sample set Training sample carry out classification mark；

Sample word-dividing mode 40 is used for：Based on the dictionary for word segmentation for being added to candidate neologisms, calculated using preset participle Method carries out word segmentation processing to the training sample in the sample set；

Model training module 50 is used for：Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by training sample The classification information of corresponding term vector and mark is input in preset multiple Weak Classifiers and is trained, and training is obtained more A Weak Classifier is combined as the textual classification model of financial field.

Above-mentioned data acquisition module 10, neologisms selecting module 20, sample labeling module 30, sample word-dividing mode 40 and model The program modules such as training module 50 are performed realized functions or operations step and are substantially the same with above-described embodiment, herein not It repeats again.

In addition, the present invention also provides a kind of generation methods of textual classification model.It is text of the present invention with reference to shown in Fig. 3 The flow chart of the generation method preferred embodiment of disaggregated model.This method can be executed by a device, which can be by soft Part and/or hardware realization.

In the present embodiment, the generation method of textual classification model includes：

Step S10 obtains the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset The corpus of text of financial field.

Step S20 selects candidate neologisms from the corpus of text according to preset algorithm, is added to the dictionary for word segmentation.

On the basis of above-mentioned dictionary for word segmentation, candidate neologisms is selected to be added to current participle word in expecting from new text In allusion quotation.Specifically, step S20 includes：Based on the dictionary for word segmentation, the corpus of text is divided using the segmentation methods Word processing obtains candidate word set according to the word segmentation result；The information for calculating each candidate word in the candidate word set increases Benefit selects the candidate word that information gain is more than the first predetermined threshold value to add the described first candidate neologisms as the first candidate neologisms It is added in the dictionary for word segmentation；Based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to institute It states corpus of text to be segmented, and term vector model is trained using the corpus of text after word segmentation processing；The word obtained using training Vector model calculates the semantic similarity of word and the described first candidate neologisms in word segmentation result；Semantic similarity is more than second The word of predetermined threshold value is added to as the second candidate neologisms, and by the described second candidate neologisms in the dictionary for word segmentation.

Step S30, obtain sample set, according to default Sentiment orientation classification mode to the training sample in the sample set into Row classification marks.

The sample set for training text disaggregated model is obtained, for each training data in sample set, is obtained more A mark people, to multiple markup informations of each training data, and selects multiple marks according to default Sentiment orientation classification mode Annotation results of the most markup information of occurrence number as the training data in information.Wherein, user can be according to be analyzed Monetary affair corresponding Sentiment orientation classification mode is set, hold, sell and buy for example, stock forum text is divided into Enter；It is positive, passive and neutral that microblogging stock discussion text is divided into；By financial and economic news text be divided into positive, negative sense and in Property etc..

Step S40, based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the sample The training sample of concentration carries out word segmentation processing.

Step S50, according to word segmentation result extract term vector, be based on adaboost algorithms, by the corresponding word of training sample to Amount and the classification information of mark are input in preset multiple Weak Classifiers and are trained, multiple Weak Classifiers that training is obtained It is combined as the textual classification model of financial field.

In the present embodiment, in order to improve the accuracy of textual classification model, word2vec models and Glove are used respectively Model extraction term vector, each word segmentation result can obtain two kinds of term vectors.In addition, convolutional Neural will be based in the present embodiment The grader of network algorithm, the grader based on Recognition with Recurrent Neural Network algorithm and the grader based on shot and long term memory network algorithm As Weak Classifier.For each Weak Classifier, respectively using above two term vector as input, then can essentially build Six weak typing models.Based on adaboost algorithms, each Weak Classifier of sample training in sample set is used.In training process In, if certain sample is accurately classified, under construction in a sample set, reduce the weights of the sample；If Certain sample is not classified accurately, then improving the weights of the sample in a sample set under construction.The sample that right value update is crossed This collection be used to that next grader, entire training process be trained so to be made iteratively down.In addition, each Weak Classifier After training process, increase the small Weak Classifier of error in classification rate weight, make its played in final classification function compared with Big decisive action, and reduce the weight of the big Weak Classifier of error in classification rate, make its played in final classification function compared with Small decisive action.Each Weak Classifier is iteratively trained as procedure described above.The Weak Classifier fusion that each training is obtained Get up, as final textual classification model.Text disaggregated model can be used for carrying out Sentiment orientation to financial field text Classification, for judge the stock in forum discuss text be passive, positive or neutral etc..

The generation method for the textual classification model that the present embodiment proposes, by the corpus of text excavation to financial field, to the greatest extent New financial field word is screened in slave language material more than possible, is added in dictionary for word segmentation, is realized to financial field dictionary for word segmentation Expansion, and word segmentation processing is carried out to the training sample in sample set using the dictionary for word segmentation after financial vocabulary has been expanded, and Classification mark is carried out to the sample data in sample set according to default Sentiment orientation classification mode, final training obtains text classification Model, the model can be applied to the Sentiment orientation classification problem of financial field.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with model generator, the model generator can be executed by one or more processors, to realize following operation：

Generating means and side of the computer readable storage medium specific implementation mode of the present invention with above-mentioned textual classification model Each embodiment of method is essentially identical, does not make tired state herein.

It should be noted that the embodiments of the present invention are for illustration only, can not represent the quality of embodiment.And The terms "include", "comprise" herein or any other variant thereof is intended to cover non-exclusive inclusion, so that packet Process, device, article or the method for including a series of elements include not only those elements, but also include being not explicitly listed Other element, or further include for this process, device, article or the intrinsic element of method.Do not limiting more In the case of, the element that is limited by sentence "including a ...", it is not excluded that in the process including the element, device, article Or there is also other identical elements in method.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of generating means of textual classification model, which is characterized in that described device includes memory and processor, described to deposit The model generator that can be run on the processor is stored on reservoir, the model generator is held by the processor Following steps are realized when row：

Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and the text of preset financial field This language material；

Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode；

Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the training sample in the sample set This progress word segmentation processing；

Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the class of training sample corresponding term vector and mark It is trained in other information input to preset multiple Weak Classifiers, multiple Weak Classifiers that training obtains is combined as financial neck The textual classification model in domain.

2. the generating means of textual classification model as described in claim 1, which is characterized in that it is described according to preset algorithm from institute The step of stating and select candidate neologisms in corpus of text, being added to the dictionary for word segmentation include：

Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to the participle As a result candidate word set is obtained；

The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the time of the first predetermined threshold value It selects word as the first candidate neologisms, the described first candidate neologisms is added in the dictionary for word segmentation；

Based on the dictionary for word segmentation for being added to the described first candidate neologisms, the corpus of text is divided using the segmentation methods Word, and train term vector model using the corpus of text after word segmentation processing；

The semantic similarity of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains；

The word that semantic similarity is more than to the second predetermined threshold value is added as the second candidate neologisms, and by the described second candidate neologisms Into the dictionary for word segmentation.

3. the generating means of textual classification model as claimed in claim 2, which is characterized in that the processor can also be used to hold The row model generator, semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms described, And after the step of the described second candidate neologisms are added to institute's predicate dictionary, also realize following steps：

Word frequency of the described second candidate neologisms in corpus of text is calculated, and the word frequency being calculated is new as second candidate Weight of the word in the dictionary for word segmentation.

4. the generating means of textual classification model as claimed any one in claims 1 to 3, which is characterized in that the acquisition Sample set, according to default Sentiment orientation classification mode in the sample set training sample carry out classification mark the step of wrap It includes：

Obtain sample set, and obtain multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set into Multiple markup informations that rower is noted, from the multiple markup information, the markup information that selects occurrence number most as The annotation results of corresponding training sample.

5. the generating means of textual classification model as claimed any one in claims 1 to 3, which is characterized in that described weak point Class device include the grader based on convolutional neural networks algorithm, the grader based on Recognition with Recurrent Neural Network algorithm and be based on shot and long term The grader of memory network algorithm.

6. a kind of generation method of textual classification model, which is characterized in that the method includes：

7. the generation method of textual classification model as claimed in claim 6, which is characterized in that it is described according to preset algorithm from institute The step of stating and select candidate neologisms in corpus of text, being added to the dictionary for word segmentation include：

8. the generation method of textual classification model as claimed in claim 7, which is characterized in that described to be more than semantic similarity The word of second predetermined threshold value as the second candidate neologisms, and the step of the described second candidate neologisms are added to institute's predicate dictionary it Afterwards, the method further includes step：

9. the generation method of the textual classification model as described in any one of claim 6 to 8, which is characterized in that the acquisition Sample set, according to default Sentiment orientation classification mode in the sample set training sample carry out classification mark the step of wrap It includes：

10. a kind of computer readable storage medium, which is characterized in that be stored with model life on the computer readable storage medium At program, the model generator can be executed by one or more processor, to realize as any in claim 6 to 9 The step of generation method of textual classification model described in.