CN115544238A

CN115544238A - Financial affairs whitewash identification system fusing company news story line characteristics

Info

Publication number: CN115544238A
Application number: CN202211133361.4A
Authority: CN
Inventors: 张涛; 罗震; 张玥杰
Original assignee: Shanghai university of finance and economics
Current assignee: Shanghai university of finance and economics
Priority date: 2022-09-17
Filing date: 2022-09-17
Publication date: 2022-12-30

Abstract

The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics. The system of the invention comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the feature representation module obtains a company news story line by using a multilayer clustering method and obtains a news story line vector representation by recursively weighting and summing news title sentence vectors according to a news story line tree structure; the mixed attention classification module obtains different representations of the company news story line by using a self-attention mechanism and a company index dimension-news story line cross-attention mechanism, and obtains a company news characteristic vector through splicing and fusion and further obtains a company financial whitewash risk judgment result. According to the method, through the discovery of the associated information among the news of the companies, the complexity of the model is reduced, and the accuracy of the model for outputting the risk index is improved; through the multi-angle mining company news risk signal, the influence of non-whitewash risk negative news on the judgment result is reduced.

Description

Financial affairs whitewash identification system fusing company news story line characteristics

Technical Field

The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics.

Background

A financial chalk identification system fusing company news story line characteristics belongs to a text classification method in the field of financial chalk identification. The method is defined as extracting the financial expense risk characteristics of the company according to external news reports or text information of annual newspapers in the operating period of the company, judging the financial expense risk of the company according to the extracted financial expense risk characteristics, and greatly depending on the selection of the text sources of the company and the representation of the characteristics. The traditional method for identifying the applause based on the emotional characteristics reflects the external evaluation on the company condition or the internal high-management attitude through a highly generalized text emotion index. The method can provide supplementary information for company indexes and assist the identification of the financial decorations of the company. However, for the news characteristics of companies, in the process of marketing company operation, the number of reports related to media supervision and self-disclosure of companies is large, and the highly generalized emotion indicators are difficult to capture the complex risk information reflected by a large amount of news.

Recently, with the development of deep learning, text embedding features have been applied to many natural language processing tasks and can achieve better results. The embedding of the text means that the grammatical and semantic information of words and sentences can be effectively reflected by learning the organization relationship between the interior of the text and the context thereof. The text embedded representation may capture more complex interactions between texts than the emotional statistical features. In the application problem with less training data, the text pre-training embedded features based on the universal corpus can also improve the effect of an application model by introducing text implicit information. Models based on text embedding features mostly adopt the following framework: the method comprises the steps of firstly, carrying out embedded expression on a general corpus training word, then taking a text vector as the input of a downstream task, and carrying out fine adjustment or direct migration on a pre-training text vector on the corpus of an application task for use. For the task with more sufficient training data, a better effect can be achieved by directly modeling the input word and sentence embedding sequence. However, financial overhead recognition based on company news is a special application problem, in which there are fewer training samples available and a single training sample corresponds to a larger amount of text input. In such a multi-document joint classification task, a complex model directly using a word or a sentence as a basic input unit is prone to a problem of under-fitting, and it is difficult to learn the association between a text feature and a problem target. Therefore, how to capture the interactive relationship in the news of a large number of companies in advance, reducing the feature quantity to be trained of the model is an important factor for improving the effect of the pink identification. In addition, news in the business period of a company contains various items such as propaganda, supervision and management, most of the news are noise texts unrelated to the company whitewash risk, and the judgment efficiency and effect of the risk identification model are greatly influenced by the existence of noise information. Therefore, how to reduce the influence of noise information is also an important factor for improving the recognition effect. The attention mechanism can give more attention to important information in data with more noise, but in different application problems, how to realize the attention module to realize the attention to target information still needs to be further designed to ensure that the attention module can capture effective problem features.

In summary, although the existing work makes some progress on the task of financial whitewashing identification based on company-associated text information, the use of news text features is not fully exploited, and the identification effect needs to be further improved.

Disclosure of Invention

The invention aims to provide a financial transaction identification system fusing company news story line characteristics, and aims to solve the problem that the existing financial transaction identification method is insufficient in utilization of external news characteristics.

The invention provides a financial overhead identification system fusing company news story line characteristics, which comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module obtains a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and obtains a company financial whitewash risk judgment result through a full-connection classification layer. The news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree;

the company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule;

the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects such as topics, entities, time and the like; the method specifically comprises the following steps:

given company c's historical newscast:

where each component represents historical news, | D ^c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news documents; deleting edges which contribute less to the connectivity of the nodes in the graph in turn in the keyword graph according to the betweenness centrality of the edges, wherein the divided keyword subgraphs are topic keyword graphs;

dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:

where each component represents a topic newslet, ti represents the ith topic,

representing the total number of topic news, the superscript c representing the corresponding company c;

establishing news document association graph for news documents belonging to the same topic, and extracting release time and release place of each news documentAnd to an entity. In particular, for news sets belonging to the same topic

News document of

According to place text similarity

Entity text similarity

Similarity of keywords

And publication time similarity

Calculating the comprehensive similarity of news documents:

the four similarity calculation methods are conventional methods (for example, refer to the study of fan laughing ice, etc.) ^[1] )，β ₁ ,β ₂ ,β ₃ ,β ₄ And connecting the documents with high similarity in the association diagram for self-defining the similarity weight. Community discovery algorithm through Louvain in document association graph ^[2] Multiple sets of news documents belonging to different stories are found:

where each component represents a story newsletter, si represents the ith story,

representing the total number of story news, the superscript c representing the corresponding company c, for a set of company story news documents

Algorithm by maximum spanning tree in its associated subgraph ^[3] And acquiring a tree-shaped development structure of the news story line, and defining the tree-shaped development structure of the story line as a story tree.

The news story line quantity representation submodule constructs story embedded representation according to a tree structure of a company story line, and the specific construction rule is as follows: the news story line quantity representation submodule defines an event line with the longest time span in the story line as a main event line, the rest of the story lines are branch event lines, an initial node vector of each node in the story line is defined as a corresponding news headline sentence vector, and vector representation of an initial node of each branch event line is defined as:

wherein,

for branching nodes v in story lines _i Corresponding set of sub-branches, m _i Is a node v _i The starting index of the corresponding sub-branch,

is a branch of

Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h _i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in a story tree, and marking the path as b ₁ Based on formula (2), vector representation e of each node in the main event line of the story line can be obtained through recursive weighted combination calculation _i Then a news story treeIs expressed as the mean value of the embedded representation of each node on the trunk event line:

the mixed attention classification module comprises a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule; the news story line self-attention submodule leads the news story line of the company to a matrix formed by a quantitative sequence

Input self-attention network SATT ^[4] To obtain an updated company news story line feature representation vector: e.g. of a cylinder _satt ＝SATT(S ^c ) Wherein n is _sl Number of news story lines, d, representing company c _semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.

The specific process of the mixed attention classification module comprises the following steps: from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensions, and recording the index dimensions as

n _asp For the index dimension number, each dimension corresponds to a plurality of dimension keywords, and dimension a is recorded _k Corresponding key word is

The corresponding word vector is

Is dimension a _k The number of corresponding keywords, calculatingAnd (3) obtaining financial dimension characteristics by the average value of the keyword vectors:

dimension matrix for company marking index analysis

Wherein, d _wemb In the form of a vector dimension of a word,

is dimension a _k The weight of the feature vector of (1), let sample i be in dimension a _k The following company index is

Is dimension a _k The number of the indexes to be used is,

by

Calculating the mean value after z standardization; in the construction process of the news story line, topics t corresponding to the story line can be obtained _p And topic keywords

Is t _p Corresponding number of keywords, keyword vector of

Calculating to obtain topic t by taking the topic keyword word vector mean value as a news story line topic vector _p Represents:

thus, the news story topic matrix for company c:

wherein n is _sl Is the number of company news story lines.

News storyline volume sequence S for a given company c ^c (already mentioned above with S) ^c Source), news storyline topic matrix P ^c And company index analysis dimension matrix A ^c (ii) a Key matrix

d _xatt Embed dimension for attention);

first, a topic matrix P is calculated by scaling the dot product as an alignment function ^c And index analysis dimension matrix A ^c Similarity of (2):

and mapping the similarity to attention weight by utilizing a softmax function, and obtaining a news story line characteristic matrix related to the financial index based on the similarity weight:

X _S ＝softmax(f(A ^c ,P ^c ))S ^c W _s ， (7)

d _xatt embed a dimension for attention;

then obtaining a news cross attention feature vector e through global averaging _xatt 。

Will self-attention vector e _satt And cross attention vector e _xatt Respectively passing through a full connecting layer:

b _satt ＝W _satt e _satt +b _satt ，

h _xatt ＝W _xatt e _xatt +b _xatt ； (8)

wherein, W _satt 、W _xatt Weight matrix being a full connection layer, b _satt 、b _xatt Is the bias vector for the fully connected layer.

Finally, splicing the two part vectors by the feature fusion layer to obtain story line mixed representation:

h _matt ＝[b _satt ,h _xatt ]， (9)

converting mixed representations into predictive probability distributions belonging to different classes

Wherein, W _o As a weight matrix of the output layer, b _o And obtaining the judgment result of the financial charting risk of the company as a bias item.

The invention provides a company financial affair whitewash identification system fusing news story line characteristics, which comprises the following working processes:

the story line characteristic representation module takes the company news story line as input and constructs story line quantity characteristic representation by combining a story line tree-shaped development structure through a news heading sentence vector recursion structure;

and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of the company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breading risks through the full connection layer based on the characteristic fusion.

The advantages of the invention include:

(1) The method is characterized in that relevance information among news of companies is discovered explicitly, and model complexity is reduced-most existing news multi-document combined classification methods take word or sentence vector sequences as classification model input, which means that parameters of the models are increased along with the increase of input sequences, and when the input characteristic sequences are long, the models can obtain good classification results by needing enough training data. For a scene with less training data of a model of company financial charm, a vector sequence of news story line granularity is used as input, so that the problem of under-fitting of the model is solved, and the accuracy of the model output risk index is improved. Meanwhile, the story line structure also introduces the correlation information among news;

(2) The risk signals of the news of the company are mined from multiple angles, the news of the company in the business period covers various items such as business, publicity, supervision and the like, and the risk signals contained in different news are different. The existing method does not distinguish the provided risk indexes from the perspective of topics and contents of news, and also ignores risk signals generated by co-occurrence information of news. In fact, the topic of the company news can help the model to judge the importance of the related news, and the influence of the non-whitewash risk negative news on the judgment result is reduced. The co-occurrence information of news can reflect hidden risk signals, for example, the occurrence of interest and interest messages in the same time period reduces the reliability of the public information of companies. Therefore, designing a feature processing module that can consider news topic information and capture co-occurrence signals for company news data is the key to improving the accuracy of risk signals. The self-attentional and cross-attentional mechanisms can efficiently extract the above information by assigning attention weights.

Drawings

Fig. 1 is a framework diagram of the present invention.

FIG. 2 is a diagram of a hybrid attention classification model framework.

Detailed Description

As known in the background art, the existing financial whiteware identification model based on text classification does not fully capture the complex interaction relationship between news texts, and does not sufficiently distinguish noise signals and risk signals; the inventors studied the problem, and considered that the cause is two points: firstly, most of the existing methods adopt emotional statistical features or text sequences as classification model input, so that in the case of a long input sequence and associated events scattered in the input sequence, the model is difficult to learn the complex interaction relationship among news under the condition of less training data. Secondly, the model taking the serialized news text as the input has relatively weak capability of distinguishing important information from non-important information under the condition of large noise news volume.

In order to further research the problems, the invention provides a financial breading identification system fusing the characteristics of a news story line of a company. The method constructs a news story line vector representation according to a company news story line structure, and then obtains the company news text mixed attention feature through a self-attention mechanism and a cross-attention mechanism. The invention inputs the news text mixed attention characteristics of the company into the full connection layer to obtain the financial charting risk judgment result of the company.

The invention is further illustrated below with reference to the figures and examples.

As shown in fig. 1, the financial breadcrumbs recognition system fusing the news story line features of the company provided by the embodiment of the present invention includes: the system comprises a company news story line characteristic representation module 01 and a mixed attention classification module 02; the company news story line characteristic representation module 01 acquires a company news story line list by using a multilayer clustering method, and generates a story line embedded representation sequence by combining a story line tree-shaped development structure to be used as the input of the mixed attention classification module 02; the mixed attention classification module 02 acquires co-occurrence information and company management information of a news story line by using a self-attention mechanism and a company index dimension-news cross-attention mechanism, generates mixed attention vector characteristics of the news story line, and outputs a company financial risk judgment result through a full connection layer.

In this embodiment, the data sets used are manually collected corporate fraud data sets, which are divided into first year fraud data sets and random year fraud data sets according to the selected corporate fraud year. 80% of the data in both data sets were randomly selected as training sets, and the remaining 20% were selected as test sets. The companies involved in the training set and the testing set in the two data sets are the same.

In this embodiment, the news event similarity weight and the news location similarity weight β are ₁ Is 0.2, the news entity similarity weight beta ₂ Is 0.2, news time similarity weight beta ₃ Is 0.2, the news keyword similarity β ₄ The weight is 0.25 and the storyline branch attenuation coefficient alpha is 0.3. The company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule, wherein the company news story line extraction submodule extracts a plurality of company news story line structures from a company news set. The specific extraction method comprises the following steps: given company c's historical newsletter:

wherein each component represents historical news, | D ^c I represents the total number of historical news, a superscript c represents a corresponding company c, keywords of each piece of news are extracted, the keywords are used as graph nodes, edges are established among the nodes with the co-occurrence times exceeding a threshold value to form a keyword graph, the edges which have small contribution to the connectivity of the nodes in the graph are sequentially deleted in the keyword graph according to the betweenness centrality of the edges, and the divided keyword subgraphs are topic keyword graphs; dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:

where each component represents a topic newslet, ti represents the ith topic,

representing the total number of topic news, the superscript c represents the corresponding company c, constructing a news document association diagram for news documents belonging to the same topic, extracting the release time, the release location and the related entities of each news document, and extracting the news sets belonging to the same topic

News document of

According to place text similarity

Entity text similarity

Similarity of keywords

And publication time similarity

Calculating the comprehensive similarity of the news documents:

wherein, the four methods for calculating the similarity refer to the researches of fan laughing ice and the like ^[1] ，β ₁ ,β ₂ ,β ₃ ,β ₄ And for the customized similarity weight, connecting the documents with high similarity in the association diagram. Community discovery algorithm through Louvain in document association graph ^[2] Discovery of multiple news document sets belonging to different stories

For a company story news document set

Algorithm by maximum spanning tree in its associated subgraph ^[3] Acquiring a tree-shaped development structure of a news story line; according to the tree structure of the company story line, a story embedded representation is constructed, and the specific construction rule is as follows: defining the event line with the longest time span in the story line as a main event line, taking the rest as branch event lines, defining the initial node vector of each node in the story line as the corresponding news headline vector, and defining the vector representation of the initial node of the branch event lineComprises the following steps:

wherein,

is a branch of

Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h _i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b ₁ The vector representation of each node in the main event line of the story line can be obtained by recursive weighted combination calculation, and the mean value of the vector representation of each node on the main event line

An embedded representation of the storyline as a whole.

In this embodiment, the mixed attention classification module includes a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule in the company news story line feature representation module, where the news story line self-attention submodule constructs a matrix S composed of a company news story line into a volume sequence ^c Input self-attention network SATT ^[4] Obtaining updated company news story line feature representation vector e through full connection layer after global averaging _satt (ii) a The company index dimension-news story line cross attention submodule obtains story line attention distribution by scaling a company index analysis dimension key word feature matrix and a news story line topic key word matrix in a dot product mode, and obtains a story line attention distribution matrix and a storyMultiplying the linear quantity sequences and obtaining a company business dimension news characterization vector through global averaging; the method comprises the following steps: news storyline volume sequence S for a given company c ^c News story line topic matrix P ^c And a company index analysis dimension matrix:

wherein,

is dimension a _i The weight of the feature vector is calculated by the mean value of the relevant index of the company c in the dimension i after z standardization,

is dimension a _i The feature vector of (2) is obtained by mean calculation of the related keyword vectors of the indexes under the dimensionality, n _asp Is the number of index dimensions. First, a cross-attention distribution is calculated using a scaled dot product and a softmax function

d _wemb Obtaining a story line cross attention weighting matrix by multiplying the attention distribution and corresponding elements of the story line vector sequence; secondly, obtaining a news story line cross attention vector e by a story line cross attention weighting matrix in a global average mode _xatt (ii) a And finally, fusing the self-attention story line characteristic vector and the cross-attention story line characteristic vector in a splicing mode and taking the fused self-attention story line characteristic vector and the cross-attention story line characteristic vector as input of a full-connection neural network layer to obtain a judgment result of the financial overhead risk of the company.

In this embodiment, the story line embedding dimension is 1024, and the word vector dimension is 300. The self-attention embedding dimension is set to 250. The cross-attention embedding dimension is set to 250. The self-attention fully-connected layer dimension is set to 30 and the cross-attention fully-connected layer dimension is 20. In both data sets, the batch size of the model was set to 16, using Adam as the model optimizer.

In order to test the financial applause identification system fusing the news story line characteristics of the company, the example uses a manually collected company fraud data set for experiment, and the data set covers 42 matched negative sample companies and 42 matched companies reporting financial applause in the presence years penalized by a regulatory agency from 2007 to 2017. The first year fraud data set includes 8630 news items and the random year fraud data set includes 9040 news items. The effectiveness and the advantages of the system are verified in multiple angles by means of design comparison experiments, comparison with other baseline experiments and the like, wherein the baseline experiments respectively adopt Decision Trees (DT), random Forests (RF), logistic Regression (LR), XGboost (XGB), support Vector Machines (SVM) and full-connection neural network Models (MLP) based on company index characteristics, news emotion characteristics and company index-news emotion combination characteristics. And a convolutional neural network model (S-CNN) based on a news headline sentence vector, a neural network model (BW-NN) based on company news TF-IDF bag-of-words features, a self-attention model (SANN) based on news story line embedding, and a cross-attention model (XANN). The specific experiment is an average result of five times of training of the model, and the model accuracy, the recall rate, the F1 score and the AUC score are used as indexes for displaying. Compared with the existing baseline model, in the random annual data set, the mixed attention Model (MANN) based on the embedded characteristics of the news story line is superior to the baseline model in all indexes, and the experimental result of the annual data set is shown in the table 1.

TABLE 1 Performance comparison of different baseline models and News storyline MANN models in a first year dataset

TABLE 2 Performance comparison of different baseline models and News storyline MANN model in random annual data set

To verify the actual contributions of the news self-attention module and the company index-news cross-attention module to the final experimental results of the model in the present invention, we constructed two variants XANN and SANN based on MANN. The SANN leaves the self-attention module unchanged, and deletes the company index-news cross-attention module, thereby testing the actual effect of the self-attention mechanism. XANN leaves the cross attention module unchanged and uses only the self attention module for news feature extraction. CNN and DNN are baseline models, CNN uses convolution layer to extract news story line characteristics, DNN adopts global pooling and extracts news story line characteristics. The experimental comparison results are shown in table 3: it can be seen that the self-attention module and the cross-attention module both promote the promotion of the experimental results.

TABLE 3 Mixed attention model Module contribution validation

In conclusion, the invention analyzes and explores the characteristics of the company news story line, provides a novel financial chalk identification system which integrates the characteristics of the company news story line, extracts the story line of news in the company report period, embeds the story line into a representation structure to obtain the characteristics of the company news story line, and extracts important characteristics based on a self-attention mechanism and a cross-attention mechanism to realize the chalk risk judgment. The accuracy of the generated decoration identification result on the random annual fraud data is higher than that of the existing method.

Although the present invention has been described in connection with the preferred embodiments, it is not intended to be limited thereto. Any person skilled in the art can make possible variations and modifications to the invention using the above disclosed methods and technical content without departing from the spirit and scope of the invention, and therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the invention shall fall within the protection scope of the technical solution of the invention, unless departing from the content of the technical solution of the invention.

Reference to the literature

[1] Fan laughing ice, bouon, king large, li rui xiang, etc. based on the named entity sensitive hierarchical news story line generation method [ J ]. Chinese information bulletin, 2021,35 (01): 113-124.

[2]Blondel V D,Guillaume J L,Lambiotte R,et al.Fast unfolding of communities in large networks[J].Journal of statistical mechanics:theory and experiment,2008,2008(10):P10008.

[3] Zhouwei, huangde, gaojie, et al. Chinese dependency analysis of the combination of maximum spanning Tree and decision-making algorithms [ J ] Chinese information report, 2012,26 (03): 16-21.

[4]Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J].Advances in neural information processing systems,2017,30。

Claims

1. A financial charm identification system fused with company news story line features is characterized by comprising a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module acquires a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and acquires a company financial breading risk judgment result through a full-connection classification layer; the news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree.

2. The financial breading identification system as claimed in claim 1 wherein the company news story line characteristic representation module comprises a company news story line extraction sub-module and a news story line characteristic sub-module; wherein:

the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects of topics, entities and time; the method specifically comprises the following steps:

given company c's historical newsletter:

where each component represents historical news, | D ^c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news document; sequentially deleting the edges which contribute less to the connectivity of the nodes in the graph according to the betweenness centrality of the edges in the keyword graph, wherein the divided keyword subgraphs are topic keyword graphs;

where each component represents a topic newslet, ti represents the ith topic,

establishing a news document association graph for news documents belonging to the same topic, and extracting the release time, the release location and the related entities of each news document; in particular, for news sets belonging to the same topic

News document of

According to place text similarity

Entity text similarity

Similarity of keywords

And publication time similarity

Calculating the comprehensive similarity of the news documents:

wherein, beta ₁ ,β ₂ ,β ₃ ,β ₄ For the self-defined similarity weight, connecting the documents with high similarity in the association diagram; a plurality of news document sets belonging to different stories are discovered in the document association graph through a Louvain community discovery algorithm:

Acquiring a tree-shaped development structure of a news story line in an associated subgraph through a maximum spanning tree algorithm, and defining the tree-shaped structure of the story line as a story tree;

the news story line quantity representation submodule constructs story embedding representation according to a tree structure of a company story line, and the specific construction rule is as follows: defining an event line with the longest time span in a story line as a main event line, taking the rest as branch event lines, defining an initial node vector of each node in the story line as a corresponding news headline vector, and defining the vector representation of the initial node of each branch event line as:

wherein,

is a branch

Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h _i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b ₁ Based on equation (2), vector representation of each node in the main event line of the story line can be obtained by recursively weighted combination calculation, and then embedding representation of the news story tree is the mean value of the embedded representation of each node on the main event line:

3. the financial breadcrumbling identification system as recited in claim 2, wherein the hybrid attention classification module comprises a news story line self-attention sub-module and a company index dimension-news story line cross-attention sub-module; the news story line self-attention submodule enables the news story line of a company to be a matrix formed by quantity sequences

Input self attentionNetwork SATT to obtain updated company news story line feature representation vectors: e.g. of a cylinder _satt ＝SATT(S ^c ) Wherein n is _sl Number of news story lines, d, representing company c _semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.

4. A financial breading identification system according to claim 3 wherein the specific process flow of the hybrid attention classification module is:

from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensionalities, and marking the index dimensionalities as index dimensionalities

Corresponding word vector is

Is dimension a _k And calculating the average value of the keyword vectors according to the number of the corresponding keywords to obtain the financial dimension characteristics:

dimension matrix for company marking index analysis

Wherein d is _wemb In the form of a vector dimension of a word,

is dimension a _k The feature vector of (2) is weighted, let sample i be in dimension a _k The company index is

Is dimension a _k The number of the indexes to be processed,

by

Calculating the mean value after z standardization; in the construction process of the news story line, the topic t corresponding to the story line is obtained _p And topic keywords

Is t _p Corresponding number of keywords, keyword vector of

thus, the news story topic matrix for company c:

wherein n is _sl Number of news story lines for a company;

news storyline volume sequence S for a given company c ^c News story line topic matrix P ^c And company index analysis dimension matrix A ^c (ii) a Key matrix S ^c W _s ，

d _xatt Embed dimensions for attention;

X _S ＝softmax(f(A ^c ,P ^c ))S ^c W _s ， (7)

d _xatt embed dimensions for attention;

then global averaging is carried out to obtain a feature vector e of cross attention of news _xatt ；

Will self-attention vector e _satt And cross attention vector e _xatt Respectively passing through a full connecting layer to obtain:

h _satt ＝W _satt e _satt +b _satt ，

h _xatt ＝W _xatt e _xatt +b _xatt ； (8)

wherein, W _satt 、W _xatt Weight matrix being a full connection layer, b _satt 、b _xatt A bias vector for a fully connected layer;

finally, splicing the two vectors of the characteristic fusion layer to obtain story line mixed expression:

h _matt ＝[h _satt ,h _xatt ]， (9)

5. A financial breading identification system as claimed in claim 4 wherein the workflow is:

the story line characteristic representation module takes the company news story line as input and constructs story line vector recursion by combining a story line tree-shaped development structure to represent the quantity characteristic by the story line vector;

and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breadcrumbling risk through the full connection layer based on the characteristic fusion.