CN113239703A

CN113239703A - Deep logical reasoning financial text analysis method and system based on multivariate factor fusion

Info

Publication number: CN113239703A
Application number: CN202110562999.9A
Authority: CN
Inventors: 李鑫; 王智
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-08-10
Anticipated expiration: 2041-05-24
Also published as: CN113239703B

Abstract

The invention discloses a deep logical reasoning financial text analysis method and system based on multi-factor fusion. Through the three modules, after a financial text and a plurality of external features enter the model proposed by the method, text coding, memory storage and information interactive understanding are sequentially carried out, and deep information contained in the text is extracted step by step through three processes, so that a reasonable inference is made.

Description

Deep logical reasoning financial text analysis method and system based on multivariate factor fusion

Technical Field

The invention relates to financial text analysis, in particular to a deep logical reasoning financial text analysis method and system based on multi-factor fusion.

Background

In the aspect of financial text analysis, the mainstream technology has been shifted from the conventional method mainly for feature extraction to the method mainly for neural network driving, and the traditional features, such as emotional word counting, key segment matching, text stream reconstruction and the like, need to consume too much manpower in the construction process, and need to reconstruct features for texts in different fields, so that the mobility is insufficient. The neural network-based method can well complete the financial text analysis task under a specific scene only by training on specific data and consuming computing power resources, so the neural network-based method is a mainstream technology in recent years.

At present, more work focuses on the selection level of the neural network in the aspect of application and research in financial texts, and such methods focus on the representation learning of texts by using the neural network, learn vectorized representations of texts by designing the neural network, and expect to find out the neural network capable of better realizing text modeling, but the methods have the following two problems:

1) these methods focus on representation learning of the text surface layer, ignoring reasoning on text semantics. However, the financial text analysis needs to help financial services, so when building a neural network model, attention should be paid to the capability of the designed calculation method on semantic understanding and logical reasoning.

2) The financial phenomenon is generated by the combined influence of a plurality of external factors, the inference is easy to fall into local optimum by using a single factor, the existing work focuses more on the text per se, the inference is carried out by using the single text factor, and the attention to some potential external factors is not enough, so that the model can form certain one-sidedness when the inference is carried out by drawing a conclusion, and the data mixed with noise lacks enough generalization.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a deep logical reasoning financial text analysis method and system based on multi-factor fusion, which extracts deep information contained in a text so as to make a reasonable inference.

The technical problem of the invention is solved by the following technical scheme:

a deep logic reasoning financial text analysis method based on multi-element fusion is characterized by comprising the following steps: s1, converting the input text into semantic vector representation; s2, semantic parsing is carried out on the text through a coding module; s3, converting the external factors into memory vectors; and S4, calculating the cross representation form of the text vector and the external factors by adopting a cross attention module, and re-expressing the cross representation form into a higher-level semantic feature to realize abstract understanding of semantics.

In some embodiments, the following features are also included:

the step S1 includes: s1-1, converting the input text into word embedding vectors;

using a dual embedding vector as an initial said word embedding vector; the dual embedding vector includes two types: generic word embedding vectors and domain-specific word embedding vectors, which are distinguished from each other by training the word embedding vectors in a corpus of specified domains.

The general word embedding vector is trained according to internet text data, the specific field word embedding vector is trained by utilizing the financial text data set constructed by the method, and the two vectors are spliced together according to specified dimensions.

During training, data are read according to batches, and the lengths of all texts are unified for the data in the same batch.

In step S2, the text is semantically parsed by an encoding module, where the encoding module is composed of a stacked cyclic neural network, the stacked cyclic neural network is a four-layer neural network structure, and each layer is composed of a bidirectional stack flow gate control unit.

The main body of the cascade control unit is composed of two gate structures, the flow directions of the input vector and the hidden layer vector are controlled by the two gate structures, and semantic information understanding in the text is realized by adjusting the information interaction process of the two vectors.

The calculation flow of the stack flow control unit comprises the following steps: given an input vector x for a current time t_tAnd memory cell c of last time period t-1_t-1Memory cell of current time periodc_tCalculated from the following formula:

wherein, W_iIs a trainable parameter of the neural network, tanh is a hyperbolic tangent function, which is a vector point product operation. f. of_tThe output of the forgetting gate is used for forgetting a part of unimportant historical memory information in the SFGU unit, and the function of compressing information flow is achieved.

In step S3, the memory module with slots is used to encode external factors and store the encoded external factors in a memory unit for use by a following module; in step S4, when calculating the high-level vector representation, weights are reassigned to each vector according to the importance of each position.

The invention also proposes a deep logical reasoning financial text analysis system based on multivariate factor fusion, characterized in that it comprises a processor and a memory, said memory having stored therein a computer program executable by the processor for implementing the method according to any one of claims 1 to 9.

Compared with the prior art, the invention has the advantages that:

the deep logical reasoning model based on the multi-dimensional data fusion provided by the method respectively realizes semantic analysis, multi-factor fusion and deep logical reasoning through three different mechanisms, namely a stacking recurrent neural network, a slot memory module and a cross attention module. Through the three modules, after a financial text and a plurality of external features enter the model proposed by the method, text coding, memory storage and information interactive understanding are sequentially carried out, and deep information contained in the text is extracted step by step through three processes, so that a reasonable inference is made.

Drawings

FIG. 1 is a diagram of a deep logical inference model based on multi-metadata fusion according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a stack flow control unit according to an embodiment of the invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and preferred embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms of orientation such as left, right, up, down, top and bottom in the present embodiment are only relative concepts to each other or are referred to the normal use state of the product, and should not be considered as limiting.

Aiming at the defects of the prior art, the method of the following embodiment of the application designs a new technical scheme, and the method is used for mining logical connection from a cause-and-effect evidence-based relation chain from the perspective of utilizing various external information, and performing deep logical reasoning on financial texts, thereby obtaining a reliable and interpretable financial conclusion. The method designs a brand-new interactive modeling mode of the external factors and the text, and further performs deep logical reasoning in an interactive fusion mode to solve the problem of cooperative understanding of the external factors and the text.

The deep logical reasoning model based on multivariate data fusion provided by the method mainly comprises three modules: the connection relation and the technical details of the coding module, the slot memory module and the cross attention module are shown in figure 1.

1. Problem definition

The method is focused on solving the financial text analysis task of multi-element factor fusion, aims to give the multi-element external factors, and carries out deep logical reasoning according to the external factors to obtain a final analysis conclusion. Specifically, a financial text X containing n words is given, i.e., X ═ w _i1, 2, 3, …, n, given m external factors T ═ T₁，t₂，t₃，…，t_mWhere t_iThe method can be a Chinese character string or a number, then deducing a conclusion Y according to X and T, wherein Y belongs to { -1, 0, 1}, the process is marked as Y ═ f (X, T) by using a mathematical notation, and the method designs a neural network model M which models an input<X，T>Mapping to output Y.

2. Technical scheme

1) Coding module

The encoder module is responsible for converting the input text into semantic vector representation, thereby achieving the semantic parsing function of the text. The first step of converting semantic vectors is to convert the input text into word-embedded vectors. The word embedding vector is a general technical scheme in the field of natural language processing and is realized by a word2vec algorithm. The method uses a dual embedding vector as the initial word embedding vector. Dual embedded vectors contain two types: the general word embedding vector and the specific field word embedding vector are distinguished by training a corpus in a specified field according to the word embedding vector. On the mathematical symbolic representation, each word w_iWill use the feature vector

Carrying out initialization in which d_GAnd d_DRespectively universal word embedded vector

And domain-specific word embedding vector

Is measured in the first dimension. | V | represents the size of the vocabulary. Thus the final input word is embedded in the vector representation

Calculated from the following formula:

wherein

And representing a vector splicing operation, and splicing the two vectors together according to the specified dimension. The method uses a 300-dimensional Glove word embedded vector as an initialized word vector matrix, a general word embedded vector is trained according to internet text data, and a specific field word embedded vector is trained by using a financial text data set constructed by the method.

When obtaining word-embedded vectors for the entire input text, there are special cases that need to be handled: since the input text may contain words other than vocabulary V, the method uses a special identifier < unk > for all words not in vocabulary, and generates a word-embedded vector for this identifier by random initialization.

In addition, when the neural network model is trained, data is read according to batches, and for data in the same batch, the lengths of the texts need to be unified. The unified mode is as follows: all texts in the same batch are taken, the maximum sentence length L is obtained, and the longest sentence length L of one text is preset_maxFor excess length L_maxPerforming post-truncation, and discarding the text with the maximum length L_maxPart (c) of (a). If the text length is smaller than L, the text needs to be filled, and the zero vector is used for filling at the rear end of the text, so that the filled text length is ensured to reach L. With the above word-embedded vector conversion, the original input word sequence X ═ w _i1, 2, 3, …, n } is converted into a word-embedded vector sequence

After the vector sequence of the text is obtained, the method carries out semantic analysis on the text through a coding module. The coding module is composed of a stacked recurrent neural networks (S-RNN), the S-RNN is a four-layer neural network structure, and each layer is composed of a bidirectional stacked flow gate control unit (SFGU). The SFGU body is shown in fig. 2.

The SFGU main body is composed of two gate structures, the flow directions of an input vector and a hidden layer vector are controlled by the two gate structures, and semantic information understanding in a text is realized by adjusting the information interaction process of the two vectors. The calculation flow of the overlap flow gate control unit is substantially the same as that of a gate control loop unit (GRU), but is modified to some extent as follows: given an input vector x for a current time t_tAnd memory cell c of last time period t-1_t-1Memory cell c of the current time period_tCalculated from the following formula:

wherein, W_iIs a trainable parameter of the neural network, tanh is a hyperbolic tangent function, which is a vector point product operation. f. of_tThe SFGU is the output of a forgetting gate, is used for forgetting a part of unimportant historical memory information in the SFGU unit, and plays a role in compressing information flow, and a specific calculation formula is shown as the following formula:

f_t＝σ(W_fx_t+U_fc_t-1)

wherein, W_fAnd U_fAre trainable parameters of the neural network and σ is the Sigmoid function. The foldover gate control unit has another output, namely the hidden layer vector h of the current time period_tThe calculation formula is shown as the following formula:

wherein s is_tThe output of the flow folding gate is used for reading input text information in the SFGU unit, and the information flow of the whole SFGU unit is dominated by a specific calculation formula shown as the following formula:

s_t＝σ(W_sx_t+U_sc_t-1)

wherein W_sAnd U_sAre trainable parameters of the neural network. The forgetting gate and the flow folding gate together form a complete SFGU unit, and the forgetting gate controls the information flow to be conducted from the last time segment to the current time segment and selectively filters partial information. The cascading valves control information to be transmitted from the SFGU unit of the current layer to the SFGU unit of the next layer, and the information of the input text is understood to play a role in abstracting text semantics.

Each layer of the stacked recurrent neural network follows a bi-directional vector coding approach, and this bi-directional representation is similar to a bidirectional long-short term memory network (LSTM). In the method, hidden layer vectors output in the forward direction and the reverse direction of the SFGU unit are spliced to be used as a final output vector H of a coding module_tThe formula is shown as follows:

wherein the content of the first and second substances,

and

the difference in the calculations is the flow of the input information: one flow is from left to right and one flow is from right to left. The method takes the last vector of the output vectors as a semantic vector representation C of the input text.

2) Slot memory module

The goal of the slot memory module is to set the external factor T to { T }₁，t₂，t₃，…，t_mThe step of encoding and then storing the encoded data in a memory unit for use by a following module is essentially to convert external factors into memory vectors. External information t_iPossibly a single word or a sequence of words, the method handling both numbers and time as strings, and thus t_iCan also be counted as t_i＝{c _i，j1, 2, 3, …, k }. The method uses an LSTM network to map t_iConversion into a vector representation m_iThe specific formula is shown as the following formula:

m_i＝LSTM(c_i，j，h_i-1)

wherein h is_i-1Is the last hidden layer output of the LSTM, and uses a zero vector as the initialization vector of the LSTM, and the output of the last state of the LSTM is the memory vector m_i. For each external factor, performing code conversion through the same LSTM network to obtain a series of memory vectors M ═ M_i|i＝1，2，3，…，m}。

3) Cross attention module

This section describes the cross attention module proposed by the present method. The basic idea of the attention mechanism is that when computing the high level vector representation, it will re-assign weights to each vector based on the importance of each location. Based on the basic idea, the method provides a cross attention module to calculate the cross representation form of the text vector and the external factors, and expresses the cross representation form of the text vector and the external factors into a higher-level semantic feature again to realize abstract understanding of semantics. Intuitively speaking, the semantic importance of words at different positions in the input text to a sentence is different, and if different external factors are looked at, the importance of different words applied to the input text is also different.

Given a series of memory vectors M ═ M _i1, 2, 3, …, n and semantic vector representation C of the input text, and selecting the memory slice m which is most relevant to the text vector from the memory unit through the cross attention module_iAnd outputting a continuous vector representation v, the output vector being obtained by weighted summation of the memory sequences, the specific calculation formula being shown as follows:

where k is the size of the memory sequence, a_iIs a correlation weight coefficient with a value range of 0,1]And sigma_ia_i1. To calculate the correlation coefficient a_iThe method designs a fully-connected neural network, which reads the memory vector m_iAnd a text vector C, and then calculating a similarity score, wherein the specific formula is shown as the following formula:

wherein W_aIs a trainable parameter of the neural network and,

representing a vector stitching operation. When all the weight coefficients g of the memory sequence are obtained₁，g₂，…，g_mThen, the method uses a Softmax function to normalize the weight coefficient to obtain a correlation coefficient a_iThe specific calculation formula is shown as the following formula, where exp is an exponential function with a natural constant e as the base:

the cross-attention module has two advantages: one is that it can assign an importance score to each memory vector based on the semantic relevance; the other is that it is differentiable, so it can be easily trained in an end-to-end fashion with other neural network modules.

After obtaining the memory vector representation v, the method uses a full-link layer to perform final fusion on the text vector C and the memory vector representation, so as to make a final conclusion inference, and the specific formula is as follows:

p＝Softmax(W_o(v+C))

wherein W_oIs a trainable parameter of the neural network, p is the final output of the model, and in the method, p is a three-dimensional vector, and each dimension represents a probability value of a corresponding label.

4) Training method

The deep layer reasoning model based on the multiple factors is an end-to-end neural network model, so the model is trained by minimizing the classified cross entropy error in a supervision training mode, and the loss function is shown as the following formula:

Loss＝-∑_x∈D∑_c∈Cy*log(M(X，T))

where D represents all training instances, C is the set of categories, (X, T) is the "text-factor set" pair entered, y is the correct label of the entered text, the value is 1 or 0, indicating whether the current category C is the correct answer, and p is the probability value predicted by the model.

The method uses a back propagation algorithm to calculate the gradients of all parameters, and updates the model parameters by a StochasticgradientDescription (SGD) optimizer. For all trainable parameters used in the neural network, initialized randomly by uniformly distributing U (-0.01, 0.01), the learning rate of the optimizer is set to 0.01.

In some variations, the slot memory module may be replaced with other types, the method uses a long-short term memory network as the main structure of the slot memory module, and word vector accumulation, a cyclic neural network and a gated cyclic unit may be alternatives to the slot memory module.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A deep logic reasoning financial text analysis method based on multi-element fusion is characterized by comprising the following steps:

s1, converting the input text into semantic vector representation;

s2, semantic parsing is carried out on the text through a coding module;

s3, converting the external factors into memory vectors;

and S4, calculating the cross representation form of the text vector and the external factors by adopting a cross attention module, and re-expressing the cross representation form into a higher-level semantic feature to realize abstract understanding of semantics.

2. The method for deep logical inference financial text analysis based on multi-factor fusion as claimed in claim 1, wherein said step S1 comprises: the input text is first converted into word-embedded vectors.

3. The method of claim 2, wherein a double-embedded vector is used as the initial word-embedded vector; the dual embedding vector includes two types: generic word embedding vectors and domain-specific word embedding vectors, which are distinguished from each other by training the word embedding vectors in a corpus of specified domains.

4. The method of claim 3, wherein the generic word embedded vectors are trained according to Internet text data, and the domain-specific word embedded vectors are trained using the financial text data set constructed by the method, and the two vectors are spliced together according to a specified dimension.

5. The method as claimed in claim 4, wherein the training process reads the data in batches, and the lengths of the texts are unified for the data in the same batch.

6. The method according to claim 1, wherein in step S2, the text is semantically parsed by a coding module, wherein the coding module is composed of a stacked recurrent neural network, the stacked recurrent neural network is a four-layer neural network structure, and each layer is composed of a bidirectional stack flow control unit.

7. The method as claimed in claim 6, wherein the overlay flow control unit body is composed of two gate structures, the two gate structures control the flow direction of the input vector and the hidden layer vector, and the semantic information understanding in the text is realized by adjusting the information interaction process of the two vectors.

8. The method for deep logical inference financial text analysis based on multi-factor fusion as claimed in claim 7, wherein the calculation process of the stack flow control unit comprises: given an input vector x for a current time t_tAnd memory cell c of last time period t-1_t-1Memory cell c of the current time period_tCalculated from the following formula:

9. The method for deep logical inference financial text analysis based on multi-element fusion as claimed in claim 1, wherein in step S3, the slot memory module is used to encode external elements and then store them in the memory unit for later module; in step S4, when calculating the high-level vector representation, weights are reassigned to each vector according to the importance of each position.

10. A deep logical inference financial text analysis system based on multivariate factor fusion, comprising a processor and a memory, the memory having stored therein a computer program executable by the processor for carrying out the method according to any one of claims 1-9.