CN114064888A

CN114064888A - Financial text classification method and system based on BERT-CNN

Info

Publication number: CN114064888A
Application number: CN202111175876.6A
Authority: CN
Inventors: 刘冠; 贾燕; 黄斐然
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-02-18

Abstract

The invention discloses a method and a system for classifying financial texts based on BERT-CNN, wherein the method comprises the following steps: preprocessing the financial text data, wherein the preprocessing operation comprises noise information removal, text processing, word segmentation processing and stop word removal; inputting the obtained input vector into an initial feature vector obtained by a BERT layer; extracting high-level feature vectors from the initial feature vectors by using a convolutional neural network; performing feature fusion on the obtained advanced feature vector and the initial feature vector; the financial text category is obtained through the linear full connection layer and the softmax classification layer. According to the method, initial features extracted by BERT and high-level features extracted by a convolutional neural network layer are fused, information of financial texts is mined by the fused features, the phenomenon of overfitting existing in model training is solved, the model classification accuracy is effectively improved, meanwhile, the two-dimensional convolution of feature composition matrixes of all the BERT layers is avoided, and further the influence of differences among feature resolutions of different layers on the performance of the model is ignored.

Description

Financial text classification method and system based on BERT-CNN

Technical Field

The invention belongs to the technical field of computer natural language processing, and particularly relates to a financial text classification method and system based on BERT-CNN.

Background

In recent years, with the rapid iteration of internet and computer technologies, a great amount of data mainly including texts is generated in a network, and how to rapidly and accurately classify texts is an urgent research topic to be solved. The method classifies and archives the text data, solves the precondition of massive text information storage, and is widely applied to the fields of digital libraries, information retrieval and the like. Text classification, which refers to the automatic assignment of text to predefined categories according to the specific content of the text, is a fundamental and important task in the field of natural language processing. The text classification has wide application, such as news classification, emotion recognition and the like, and can provide a basis for more complex language processing tasks in downstream tasks, such as relation extraction, question answering and the like.

Natural language processing is a sub-direction of artificial intelligence, aiming at letting computer solve human language. The financial text classification is carried out by a natural language processing and deep learning method, so that the singleness, subjectivity and the like when the features are extracted in the feature engineering can be avoided, and the task of extracting the features is finished by handing over to a financial text classification model. In natural language processing, financial text classification can be viewed as a sequence-to-category task model, with typical tasks being emotion analysis, text mining, irony detection, and so on. And deep learning models such as long-short term memory, gated neural networks, bidirectional recurrent neural networks and BERT (belief propagation) using a transformer are superior to machine learning algorithms in tasks of various natural language processing. Where BERT refreshes the record of 11 natural language processing tasks at the time of presentation.

Financial text classification is the branching of text classification into the financial domain. The financial text classification can help financial practitioners such as stock investors to analyze a large amount of texts related to investment analysis every day, so that industry dynamic and future trends are grasped, and reasonable investment decisions and investment portfolio configuration are made. Therefore, it is very important to automatically classify a large amount of financial texts generated daily into prescribed categories accurately so that researchers can perform research, and a large amount of financial texts can be rapidly and accurately classified by a method of natural language processing and deep learning. Although the financial text classification is a branch of text classification, it has special domain characteristics compared with general text classification. Compared with the general text classification task, the financial text classification task has the problems of more categories, finer classification granularity and related knowledge in specific fields. And the inventor finds that the reuse effect of the pre-training model on the text in the financial field is not ideal.

Disclosure of Invention

In order to overcome the defects and shortcomings of the prior art, the invention provides a BERT-CNN-based financial text classification method, which fuses features extracted from different layers to obtain a more accurate classification effect.

The invention also provides a financial text classification system based on the BERT-CNN.

In order to achieve the purpose, the invention adopts the following technical scheme:

a financial text classification method based on BERT-CNN comprises the following steps:

s1 text preprocessing: preprocessing the financial text original data to obtain financial text input data, wherein the preprocessing operation comprises noise information removal, text processing, word segmentation processing and stop word removal;

s2, establishing a financial text category classification model based on BERT-CNN: generating a financial text input vector based on financial text input data, inputting the financial text input vector into a financial text category classification model, sequentially passing through an embedding layer, a BERT layer and a CNN layer of the model, and extracting and outputting a feature vector of each sentence in the text;

the embedded layer is used for converting each word in the text into corresponding vectorization expression, the BERT layer is used for extracting an initial feature vector, the CNN layer is used for extracting a high-level feature vector from the initial feature vector, the financial text classification model is obtained by using multiple groups of data through machine learning training, each group of data in the multiple groups of data comprises financial text preprocessing training data and corresponding financial text classifications, and the financial text preprocessing training data is data of the financial text original training data after text preprocessing in step S1;

performing feature fusion on the obtained advanced feature vector and the initial feature vector; obtaining the classification category of the text through a linear full-connection layer and a softmax classification layer;

and S3 feature fusion: performing feature fusion on the initial features and the advanced features to obtain fusion features;

s4 text category output: and taking the sign fusion characteristics as input, sequentially passing through the linear full-connection layer and the softmax classification layer, outputting probability distribution of financial text classification, and selecting the text category corresponding to the maximum probability value as the final financial text category for outputting.

As a preferred technical solution, the removing noise information specifically includes: processing noise information irrelevant to the classification task, and filtering the text noise information by adopting a regular expression;

processing the noise information of expressions in a code matching mode, specifically, directly filtering in a text code matching Unicode mode, or filtering by matching an allowable character list;

and processing the noise information of the URL and the nonsense character class by independently establishing a URL and a nonsense character library aiming at the training text.

As a preferred technical solution, the text processing specifically includes: based on the specialty of the financial text data, a financial vocabulary replacement table for text processing is established, and the financial vocabulary replacement table is used for replacing the abbreviation and the abbreviation of the financial text professional term with the full name of the noun and replacing the stock code with the full name of the company.

As a preferred technical solution, the word segmentation processing specifically includes: selecting word segmentation granularity according to a demand scene, processing word segmentation of the text by using an nltk word segmentation tool for the text without noise information according to the word segmentation granularity, converting text data into corresponding word segmentation vectors after word segmentation, wherein the word segmentation granularity comprises sentence level, phrase level, word level and character level.

As a preferred technical solution, the removal stop word specifically includes: and performing stop word processing on the participle vector obtained through the participle processing by using stopwords in the ntlk packet.

In the embedding layer, the word segmentation vectors are adjusted to meet the requirement of BERT input word segmentation vectors by taking the preprocessed word segmentation text word segmentation vectors as input units, initial word embedding vectors corresponding to the word segmentation vectors of the text are obtained by adopting a word embedding matrix of pre-trained BERT, and position vectors representing word positions are added to obtain word embedding vectors fusing position information of the word segmentation in the text and segment embedding vectors set to be all 0, so that each word segmentation of the financial text preprocessed in the step S1 is converted into corresponding vectorization expression.

As a preferred technical solution, in the BERT layer, a structure in which a plurality of transform encoders are stacked is adopted;

in a transform encoder, each layer comprises a multi-head attention layer and a position full-connection feedforward network, wherein a multi-head attention layer is arranged in each layer firstly, keys, values and query vectors of the multi-head attention layer are all from the input of the encoder, a residual error structure is added into a sub-layer of the multi-head attention layer, and the output of the multi-head attention and the summation result of the input of the multi-head attention layer are subjected to layer normalization during output;

the position full-connection feedforward network comprises a full-connection layer with an active function Relu and a full-connection layer without the active function, and the residual error of the input and the output of the position full-connection feedforward network is subjected to primary layer standardization again to obtain the final output of the layer.

As a preferred technical scheme, in a CNN layer, an initial feature vector is used as an input, and a feature vector of a text is further extracted through a convolutional neural network layer to obtain a high-level feature vector;

the feature vectors of the text are further extracted through the convolutional neural network layer with the residual error structure, the convolutional neural network layer without the residual error structure comprises a one-dimensional convolutional layer, an activation layer and a full connection layer, and the one-dimensional convolutional layer, the activation layer and the full connection layer are sequentially combined to form a structural unit of the convolutional neural network layer without the residual error structure;

the residual error is obtained from the input of one-dimensional convolution and combined with the full-connected layer inside the structural unit, the number of the structural units is the hyper-parameter of the convolutional neural network layer, the convolutional neural network layer is constructed by the same structural units, and the ELU activation function is used in the final full-connected layer.

As a preferred technical solution, in step S3, the specific steps are as follows: and performing feature fusion on the output of the BERT layer and the output of the CNN layer by adopting a feature fusion formula, wherein the feature fusion formula is as follows:

Y＝α₁X+α₂F(X)

where X is the feature tensor of the output of the BERT layer, F (X) is the output of the input X through the CNN layer, α₁,α₂∈[-2,2]。

A BERT-CNN based financial text classification system comprising: the system comprises a text preprocessing module, a model training module and a text classification module;

the text preprocessing module is used for preprocessing financial text original data to obtain financial text input data, wherein the preprocessing operation comprises noise information removal, text processing, word segmentation processing and stop word removal;

the text classification module is used for inputting a financial text input vector into a financial text category classification model, sequentially passing through an embedding layer, a BERT layer and a convolutional neural network layer of the model, extracting and outputting a feature vector of each sentence in the text, obtaining a fusion feature from an initial feature extracted by the BERT layer and a high-level feature extracted by a CNN, taking the fusion feature as input, sequentially passing through a linear full-connection layer and a softmax classification layer, outputting probability distribution of financial text classification, and selecting a text category corresponding to the maximum probability value as a final financial text category of the text for output;

the model training module is used for obtaining a financial text type classification model by using multiple groups of data through machine learning training, each group of data in the multiple groups of data comprises financial text preprocessing training data and a corresponding financial text type, and the financial text preprocessing training data is data obtained by processing financial text original training data through the text preprocessing module.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the financial text classification method based on the BERT-CNN obtains high-level feature vectors by processing initial features extracted by the BERT through a convolutional neural network layer, fuses the initial features extracted by the BERT and the high-level features extracted by the convolutional neural network layer, and excavates information of financial texts through the fused features, so that the phenomenon of overfitting in model training is solved, the model classification accuracy is effectively improved, compared with the similar model using multiple layers of BERT to extract the features, the financial text classification method is higher in calculation efficiency, meanwhile, the phenomenon that two-dimensional convolution is directly carried out by using feature composition matrixes of all layers of the BERT is avoided, and further the influence of differences among feature resolutions of different layers on the model performance is ignored.

Drawings

Fig. 1 is a schematic view of a processing flow framework of a BERT-CNN-based financial text classification method in embodiment 1 of the present invention;

fig. 2 is a flowchart illustrating the steps of the BERT-CNN-based financial text classification method according to embodiment 1 of the present invention;

fig. 3 is a schematic flow chart illustrating data preprocessing performed by the BERT-CNN-based financial text classification method in embodiment 1 of the present invention;

fig. 4 is a schematic flow chart illustrating feature fusion performed by the BERT-CNN-based financial text classification method in embodiment 1 of the present invention;

FIG. 5 is a schematic view showing the classification of the fully-connected layer and the softmax layer in example 1 of the present invention;

fig. 6 is a block diagram schematically illustrating the structure of the BERT-CNN-based financial text classification system in embodiment 2 of the present invention.

Detailed Description

In the description of the present disclosure, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing and simplifying the present disclosure, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present disclosure.

Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item appearing before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

In the description of the present disclosure, it is to be noted that the terms "mounted," "connected," and "connected" are to be construed broadly unless otherwise explicitly stated or limited. For example, the connection can be fixed, detachable or integrated; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present disclosure can be understood in specific instances by those of ordinary skill in the art. In addition, technical features involved in different embodiments of the present disclosure described below may be combined with each other as long as they do not conflict with each other.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

Example 1

As shown in fig. 1, the present embodiment provides a method for classifying financial texts based on BERT-CNN, which includes the following steps:

preprocessing financial text data, wherein the preprocessing operation comprises noise information removal, text processing, word segmentation processing and stop word removal;

inputting the obtained input vector into an initial feature vector obtained by a BERT layer;

extracting high-level feature vectors from the initial feature vectors by using a convolutional neural network;

performing feature fusion on the obtained advanced feature vector and the initial feature vector;

the financial text category is obtained through the linear full connection layer and the softmax classification layer.

With reference to fig. 1 and fig. 2, this embodiment further describes a method for classifying financial texts based on BERT-CNN, where the method specifically includes:

s1 text preprocessing: preprocessing the financial text original data to obtain financial text input data, specifically combining with fig. 3, wherein the preprocessing includes removing noise information, text processing, word segmentation processing, and removing stop words;

s1.1 removing noise information

For financial texts needing to be classified, the financial texts comprise noise information which is irrelevant to classification tasks, such as expressions, URLs (uniform resource locators), meaningless characters such as messy codes and the like, and the noise information is filtered by preprocessing the texts by adopting a regular expression; for noise information such as expressions, the noise information can be processed in a code matching mode, directly filtered in a text code matching Unicode mode, or filtered by matching an allowable character list; the URL and the meaningless character library need to be independently established aiming at the training text by self;

s1.2 text processing

Establishing a financial vocabulary replacement table for text processing based on the specialty of financial text data, wherein the financial vocabulary replacement table is used for replacing the abbreviation and the abbreviation of the financial text professional term with the full name of the noun and replacing the stock code with the full name of the company;

s1.3 word segmentation processing

And for the text subjected to noise information removal, performing word segmentation on the text by using an nltk word segmentation tool, converting text data into corresponding word segmentation vectors after word segmentation, wherein the word segmentation can be performed by adopting different granularities, such as sentence level, phrase level, word level and character level. The word segmentation processing of different languages has respective characteristics, such as: in English data, because of the change of parts of speech, there are special operations such as part of speech reduction and stem extraction, and in Chinese data, the selection of granularity is one of the main factors, and it is necessary to select proper participle granularity in the scene of financial text to achieve better model effect;

s1.4 removing stop words

For the word segmentation vectors obtained after word segmentation, words which have no effect on the text classification task exist, namely stop words. Stop words (stopwords) refer to words such as articles, prepositions, adverbs and conjunctions which have no influence on the semantic meaning basically, stop word processing can be performed on the participle vectors obtained through the participle processing by adopting stopwords in the ntlk packet, the participle vectors are optimized, the length of the participle vectors is shortened, and the derivation efficiency of the model is improved;

s2, establishing a financial text category classification model based on BERT-CNN: generating a financial text input vector based on financial text input data, inputting the financial text input vector into a financial text category classification model, sequentially passing through an embedding layer, a BERT layer and a convolutional neural network layer of the model, and extracting and outputting a feature vector of each sentence in the text, wherein the feature vector is specifically shown in a combined graph 4;

the financial text type classification model is obtained by using multiple groups of data through machine learning training, each group of data in the multiple groups of data comprises financial text preprocessing training data and a corresponding financial text type, and the financial text preprocessing training data is data of financial text original training data after text preprocessing in step S1.

S2.1 Embedded layer

Taking the word segmentation vector of the preprocessed word segmentation text as an input unit, adding special characters [ CLS ] at two ends of the word segmentation vector to meet the requirement of BERT on the input word segmentation vector, then obtaining an initial word embedding vector corresponding to the word segmentation vector of the text by adopting a word embedding matrix of the pre-trained BERT, and adding a position vector representing the word position to obtain a word embedding vector fusing position information of the word in the text and a segment embedding vector set to be all 0, thereby converting each word segmentation of the financial text preprocessed by the step S1 into corresponding vectorization expression.

S2.2 BERT layer

The basic version of BERT is a model proposed by google team, meaning that a bidirectional encoder representation using a transform, which employs a stacked transform encoder structure, is used. In this embodiment, BERT employs a 12-layer transform encoder, which is a stack of multiple identical layers, and each layer is first of all multi-headed attention, whose keys, values, and query vectors are derived from previous encoder inputs, while a residual structure is added to a sub-layer of multi-headed attention, and after summing the output of multi-headed attention and its inputs, a layer normalization is performed, and then a position fully-connected feedforward network is entered, which includes two fully-connected layers, the first layer being a fully-connected layer with an activation function Relu, and the second layer being a fully-connected layer without an activation function. Finally, carrying out primary layer standardization on the input and output residual errors of the position full-connection feedforward network to obtain final output of the layer;

in this embodiment, the initial feature vector is obtained by extracting the input vector of the financial text at the BERT layer, and besides, those skilled in the art can adjust the number of transform encoders of the BERT layer according to the actual situation, and the BERT layer may be composed of 12 to 24 transform encoders.

S2.3 convolutional neural network layer (CNN layer)

And taking the initial feature vector as an input, and further extracting the feature vector of the text through a convolutional neural network layer to obtain a high-level feature vector. The output of the BERT layer is a vector of 768 dimensions. If the convolutional neural network layer adopts one-dimensional convolution operation, the lengths of the output vector and the input vector can be unified through a 3 x 3 convolution kernel and the operation of filling 0 with 1 dimension on two sides, so that the operation of feature fusion is facilitated;

the feature vectors of the text are further extracted through the convolutional neural network layer with the residual error structure, the convolutional neural network layer without the residual error structure comprises a one-dimensional convolutional layer, an activation layer and a full connection layer, and the one-dimensional convolutional layer, the activation layer and the full connection layer are sequentially combined to form a structural unit of the convolutional neural network layer without the residual error structure. The residual error is obtained from the input of one-dimensional convolution and combined with the full-connected layer inside the structural unit after the full-connected layer, the number of the structural units is the hyper-parameter of the convolutional neural network layer, the convolutional neural network layer is constructed by the same structural unit, and different activation functions are used in the final full-connected layer, such as ELU activation functions, namely exponential activation functions, and the formula is specifically as follows:

where x represents the input of the fully connected layer and α represents the output adjustment factor when x < 0.

And S3 feature fusion: and performing feature fusion on the initial features extracted by the BERT layer and the advanced features extracted by the CNN layer to further obtain fusion features, which is specifically shown in a combined manner in FIG. 4. The method comprises the following specific steps: and performing feature fusion on the output of the BERT layer and the output of the CNN layer by adopting a feature fusion formula, wherein the feature fusion formula is as follows:

Y＝α₁X+α₂F(X)

S4 text category output: as shown in fig. 5, the feature fusion feature is used as an input, the probability distribution of the financial text classification is output sequentially through the linear full-link layer and the softmax classification layer, and the text category corresponding to the maximum probability value is selected as the final financial text category of the text to be output.

S4.1 Linear full link layer processing

Passing the fused features as input through a linear full link layer;

s4.2Softmax taxonomic layer processing

And taking the output of the linear full-connection layer in the S4.1 as input, obtaining text classification probability distribution through the softmax classification layer, and selecting the text category corresponding to the maximum probability value as the final financial text category output of the text.

Wherein the calculation formula of Softmax is

Where k is the number of classes, x_iThe method is characterized in that the method is the output of a full connection layer of a corresponding category, the output with a value range of a real number set is normalized through a softxmax layer, and an output model judges that the probability of the corresponding financial text category is softmax (x)_i) And the method is more suitable for the intuitive understanding of human beings.

Example 2

As shown in fig. 6, the present embodiment provides a BERT-CNN-based financial text classification system, which includes: the device comprises a text preprocessing module, a model training module and a text classification module.

In this embodiment, the text preprocessing module is configured to perform preprocessing operation on financial text raw data to obtain financial text input data, where the preprocessing operation includes removing noise information, text processing, word segmentation processing, and removing stop words;

in this embodiment, the text classification module is configured to input a financial text input vector into a financial text category classification model, sequentially pass through an embedding layer, a BERT layer, and a convolutional neural network layer of the model, extract and output a feature vector of each sentence in the text, obtain a fusion feature from an initial feature extracted by the BERT layer and a high-level feature extracted by the CNN, use the fusion feature as an input, sequentially pass through a linear full-link layer and a softmax classification layer, output a probability distribution of a financial text classification, and select a text category corresponding to a maximum probability value as a final financial text category for outputting;

in this embodiment, the model training module is configured to obtain a financial text category classification model through machine learning training using multiple sets of data, where each set of data in the multiple sets of data includes financial text preprocessing training data and a corresponding financial text category, and the financial text preprocessing training data is data obtained by processing financial text original training data through the text preprocessing module.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that the system provided in this embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A financial text classification method based on BERT-CNN is characterized by comprising the following steps:

2. The BERT-CNN-based financial text classification method according to claim 1, wherein the noise-removal information is specifically: processing noise information irrelevant to the classification task, and filtering the text noise information by adopting a regular expression;

3. The BERT-CNN-based financial text classification method according to claim 2, wherein the text processing is specifically: based on the specialty of the financial text data, a financial vocabulary replacement table for text processing is established, and the financial vocabulary replacement table is used for replacing the abbreviation and the abbreviation of the financial text professional term with the full name of the noun and replacing the stock code with the full name of the company.

4. The BERT-CNN-based financial text classification method according to claim 3, wherein the word segmentation process specifically comprises: selecting word segmentation granularity according to a demand scene, processing word segmentation of the text by using an nltk word segmentation tool for the text without noise information according to the word segmentation granularity, converting text data into corresponding word segmentation vectors after word segmentation, wherein the word segmentation granularity comprises sentence level, phrase level, word level and character level.

5. The BERT-CNN-based financial text classification method according to claim 4, wherein the removal of stop words specifically comprises: and performing stop word processing on the participle vector obtained through the participle processing by using stopwords in the ntlk packet.

6. The BERT-CNN-based financial text classification method according to claim 1, wherein in the embedding layer, the preprocessed participle text participle vectors are used as input units, the participle vectors are adjusted to meet requirements of BERT input participle vectors, initial word embedding vectors corresponding to the participle vectors of the text are obtained by using a word embedding matrix of a pre-trained BERT, and position vectors representing word positions are added to obtain word embedding vectors fusing position information of the participle in the text and segment embedding vectors set to all 0S, so that each participle of the financial text preprocessed in step S1 is converted into a corresponding vectorized representation.

7. The BERT-CNN-based financial text classification method according to claim 1, wherein in the BERT layer, a structure in which a plurality of transform encoders are stacked is employed;

8. The BERT-CNN-based financial text classification method according to claim 1, wherein in the CNN layer, the feature vectors of the text are further extracted by the convolutional neural network layer to obtain high-level feature vectors, with the initial feature vectors as input;

9. The BERT-CNN-based financial text classification method according to claim 1, wherein in step S3, the concrete steps are: and performing feature fusion on the output of the BERT layer and the output of the CNN layer by adopting a feature fusion formula, wherein the feature fusion formula is as follows:

Y＝α₁X+α₂F(X)

10. A BERT-CNN based financial text classification system, comprising: the system comprises a text preprocessing module, a model training module and a text classification module;