CN115544238A - Financial affairs whitewash identification system fusing company news story line characteristics - Google Patents

Financial affairs whitewash identification system fusing company news story line characteristics Download PDF

Info

Publication number
CN115544238A
CN115544238A CN202211133361.4A CN202211133361A CN115544238A CN 115544238 A CN115544238 A CN 115544238A CN 202211133361 A CN202211133361 A CN 202211133361A CN 115544238 A CN115544238 A CN 115544238A
Authority
CN
China
Prior art keywords
news
company
story line
attention
story
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211133361.4A
Other languages
Chinese (zh)
Inventor
张涛
罗震
张玥杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai university of finance and economics
Original Assignee
Shanghai university of finance and economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai university of finance and economics filed Critical Shanghai university of finance and economics
Priority to CN202211133361.4A priority Critical patent/CN115544238A/en
Publication of CN115544238A publication Critical patent/CN115544238A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics. The system of the invention comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the feature representation module obtains a company news story line by using a multilayer clustering method and obtains a news story line vector representation by recursively weighting and summing news title sentence vectors according to a news story line tree structure; the mixed attention classification module obtains different representations of the company news story line by using a self-attention mechanism and a company index dimension-news story line cross-attention mechanism, and obtains a company news characteristic vector through splicing and fusion and further obtains a company financial whitewash risk judgment result. According to the method, through the discovery of the associated information among the news of the companies, the complexity of the model is reduced, and the accuracy of the model for outputting the risk index is improved; through the multi-angle mining company news risk signal, the influence of non-whitewash risk negative news on the judgment result is reduced.

Description

Financial affairs whitewash identification system fusing company news story line characteristics
Technical Field
The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics.
Background
A financial chalk identification system fusing company news story line characteristics belongs to a text classification method in the field of financial chalk identification. The method is defined as extracting the financial expense risk characteristics of the company according to external news reports or text information of annual newspapers in the operating period of the company, judging the financial expense risk of the company according to the extracted financial expense risk characteristics, and greatly depending on the selection of the text sources of the company and the representation of the characteristics. The traditional method for identifying the applause based on the emotional characteristics reflects the external evaluation on the company condition or the internal high-management attitude through a highly generalized text emotion index. The method can provide supplementary information for company indexes and assist the identification of the financial decorations of the company. However, for the news characteristics of companies, in the process of marketing company operation, the number of reports related to media supervision and self-disclosure of companies is large, and the highly generalized emotion indicators are difficult to capture the complex risk information reflected by a large amount of news.
Recently, with the development of deep learning, text embedding features have been applied to many natural language processing tasks and can achieve better results. The embedding of the text means that the grammatical and semantic information of words and sentences can be effectively reflected by learning the organization relationship between the interior of the text and the context thereof. The text embedded representation may capture more complex interactions between texts than the emotional statistical features. In the application problem with less training data, the text pre-training embedded features based on the universal corpus can also improve the effect of an application model by introducing text implicit information. Models based on text embedding features mostly adopt the following framework: the method comprises the steps of firstly, carrying out embedded expression on a general corpus training word, then taking a text vector as the input of a downstream task, and carrying out fine adjustment or direct migration on a pre-training text vector on the corpus of an application task for use. For the task with more sufficient training data, a better effect can be achieved by directly modeling the input word and sentence embedding sequence. However, financial overhead recognition based on company news is a special application problem, in which there are fewer training samples available and a single training sample corresponds to a larger amount of text input. In such a multi-document joint classification task, a complex model directly using a word or a sentence as a basic input unit is prone to a problem of under-fitting, and it is difficult to learn the association between a text feature and a problem target. Therefore, how to capture the interactive relationship in the news of a large number of companies in advance, reducing the feature quantity to be trained of the model is an important factor for improving the effect of the pink identification. In addition, news in the business period of a company contains various items such as propaganda, supervision and management, most of the news are noise texts unrelated to the company whitewash risk, and the judgment efficiency and effect of the risk identification model are greatly influenced by the existence of noise information. Therefore, how to reduce the influence of noise information is also an important factor for improving the recognition effect. The attention mechanism can give more attention to important information in data with more noise, but in different application problems, how to realize the attention module to realize the attention to target information still needs to be further designed to ensure that the attention module can capture effective problem features.
In summary, although the existing work makes some progress on the task of financial whitewashing identification based on company-associated text information, the use of news text features is not fully exploited, and the identification effect needs to be further improved.
Disclosure of Invention
The invention aims to provide a financial transaction identification system fusing company news story line characteristics, and aims to solve the problem that the existing financial transaction identification method is insufficient in utilization of external news characteristics.
The invention provides a financial overhead identification system fusing company news story line characteristics, which comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module obtains a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and obtains a company financial whitewash risk judgment result through a full-connection classification layer. The news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree;
the company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule;
the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects such as topics, entities, time and the like; the method specifically comprises the following steps:
given company c's historical newscast:
Figure BDA0003850873290000021
where each component represents historical news, | D c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news documents; deleting edges which contribute less to the connectivity of the nodes in the graph in turn in the keyword graph according to the betweenness centrality of the edges, wherein the divided keyword subgraphs are topic keyword graphs;
dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:
Figure BDA0003850873290000022
where each component represents a topic newslet, ti represents the ith topic,
Figure BDA0003850873290000023
representing the total number of topic news, the superscript c representing the corresponding company c;
establishing news document association graph for news documents belonging to the same topic, and extracting release time and release place of each news documentAnd to an entity. In particular, for news sets belonging to the same topic
Figure BDA0003850873290000024
News document of
Figure BDA0003850873290000025
According to place text similarity
Figure BDA0003850873290000026
Entity text similarity
Figure BDA0003850873290000027
Similarity of keywords
Figure BDA0003850873290000028
And publication time similarity
Figure BDA0003850873290000029
Calculating the comprehensive similarity of news documents:
Figure BDA00038508732900000210
Figure BDA0003850873290000031
the four similarity calculation methods are conventional methods (for example, refer to the study of fan laughing ice, etc.) [1] ),β 1234 And connecting the documents with high similarity in the association diagram for self-defining the similarity weight. Community discovery algorithm through Louvain in document association graph [2] Multiple sets of news documents belonging to different stories are found:
Figure BDA0003850873290000032
where each component represents a story newsletter, si represents the ith story,
Figure BDA0003850873290000033
representing the total number of story news, the superscript c representing the corresponding company c, for a set of company story news documents
Figure BDA0003850873290000034
Algorithm by maximum spanning tree in its associated subgraph [3] And acquiring a tree-shaped development structure of the news story line, and defining the tree-shaped development structure of the story line as a story tree.
The news story line quantity representation submodule constructs story embedded representation according to a tree structure of a company story line, and the specific construction rule is as follows: the news story line quantity representation submodule defines an event line with the longest time span in the story line as a main event line, the rest of the story lines are branch event lines, an initial node vector of each node in the story line is defined as a corresponding news headline sentence vector, and vector representation of an initial node of each branch event line is defined as:
Figure BDA0003850873290000035
wherein,
Figure BDA0003850873290000036
for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,
Figure BDA0003850873290000037
is a branch of
Figure BDA0003850873290000038
Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in a story tree, and marking the path as b 1 Based on formula (2), vector representation e of each node in the main event line of the story line can be obtained through recursive weighted combination calculation i Then a news story treeIs expressed as the mean value of the embedded representation of each node on the trunk event line:
Figure BDA0003850873290000039
the mixed attention classification module comprises a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule; the news story line self-attention submodule leads the news story line of the company to a matrix formed by a quantitative sequence
Figure BDA00038508732900000310
Input self-attention network SATT [4] To obtain an updated company news story line feature representation vector: e.g. of a cylinder satt =SATT(S c ) Wherein n is sl Number of news story lines, d, representing company c semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.
The specific process of the mixed attention classification module comprises the following steps: from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensions, and recording the index dimensions as
Figure BDA00038508732900000311
n asp For the index dimension number, each dimension corresponds to a plurality of dimension keywords, and dimension a is recorded k Corresponding key word is
Figure BDA00038508732900000312
The corresponding word vector is
Figure BDA00038508732900000313
Is dimension a k The number of corresponding keywords, calculatingAnd (3) obtaining financial dimension characteristics by the average value of the keyword vectors:
Figure BDA0003850873290000041
dimension matrix for company marking index analysis
Figure BDA0003850873290000042
Wherein, d wemb In the form of a vector dimension of a word,
Figure BDA0003850873290000043
is dimension a k The weight of the feature vector of (1), let sample i be in dimension a k The following company index is
Figure BDA0003850873290000044
Is dimension a k The number of the indexes to be used is,
Figure BDA0003850873290000045
by
Figure BDA0003850873290000046
Calculating the mean value after z standardization; in the construction process of the news story line, topics t corresponding to the story line can be obtained p And topic keywords
Figure BDA0003850873290000047
Is t p Corresponding number of keywords, keyword vector of
Figure BDA0003850873290000048
Calculating to obtain topic t by taking the topic keyword word vector mean value as a news story line topic vector p Represents:
Figure BDA0003850873290000049
thus, the news story topic matrix for company c:
Figure BDA00038508732900000410
Figure BDA00038508732900000411
wherein n is sl Is the number of company news story lines.
News storyline volume sequence S for a given company c c (already mentioned above with S) c Source), news storyline topic matrix P c And company index analysis dimension matrix A c (ii) a Key matrix
Figure BDA00038508732900000412
d xatt Embed dimension for attention);
first, a topic matrix P is calculated by scaling the dot product as an alignment function c And index analysis dimension matrix A c Similarity of (2):
Figure BDA00038508732900000413
and mapping the similarity to attention weight by utilizing a softmax function, and obtaining a news story line characteristic matrix related to the financial index based on the similarity weight:
X S =softmax(f(A c ,P c ))S c W s , (7)
Figure BDA00038508732900000414
d xatt embed a dimension for attention;
then obtaining a news cross attention feature vector e through global averaging xatt
Will self-attention vector e satt And cross attention vector e xatt Respectively passing through a full connecting layer:
b satt =W satt e satt +b satt
h xatt =W xatt e xatt +b xatt ; (8)
wherein, W satt 、W xatt Weight matrix being a full connection layer, b satt 、b xatt Is the bias vector for the fully connected layer.
Finally, splicing the two part vectors by the feature fusion layer to obtain story line mixed representation:
h matt =[b satt ,h xatt ], (9)
converting mixed representations into predictive probability distributions belonging to different classes
Figure BDA00038508732900000415
Figure BDA0003850873290000051
Wherein, W o As a weight matrix of the output layer, b o And obtaining the judgment result of the financial charting risk of the company as a bias item.
The invention provides a company financial affair whitewash identification system fusing news story line characteristics, which comprises the following working processes:
the story line characteristic representation module takes the company news story line as input and constructs story line quantity characteristic representation by combining a story line tree-shaped development structure through a news heading sentence vector recursion structure;
and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of the company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breading risks through the full connection layer based on the characteristic fusion.
The advantages of the invention include:
(1) The method is characterized in that relevance information among news of companies is discovered explicitly, and model complexity is reduced-most existing news multi-document combined classification methods take word or sentence vector sequences as classification model input, which means that parameters of the models are increased along with the increase of input sequences, and when the input characteristic sequences are long, the models can obtain good classification results by needing enough training data. For a scene with less training data of a model of company financial charm, a vector sequence of news story line granularity is used as input, so that the problem of under-fitting of the model is solved, and the accuracy of the model output risk index is improved. Meanwhile, the story line structure also introduces the correlation information among news;
(2) The risk signals of the news of the company are mined from multiple angles, the news of the company in the business period covers various items such as business, publicity, supervision and the like, and the risk signals contained in different news are different. The existing method does not distinguish the provided risk indexes from the perspective of topics and contents of news, and also ignores risk signals generated by co-occurrence information of news. In fact, the topic of the company news can help the model to judge the importance of the related news, and the influence of the non-whitewash risk negative news on the judgment result is reduced. The co-occurrence information of news can reflect hidden risk signals, for example, the occurrence of interest and interest messages in the same time period reduces the reliability of the public information of companies. Therefore, designing a feature processing module that can consider news topic information and capture co-occurrence signals for company news data is the key to improving the accuracy of risk signals. The self-attentional and cross-attentional mechanisms can efficiently extract the above information by assigning attention weights.
Drawings
Fig. 1 is a framework diagram of the present invention.
FIG. 2 is a diagram of a hybrid attention classification model framework.
Detailed Description
As known in the background art, the existing financial whiteware identification model based on text classification does not fully capture the complex interaction relationship between news texts, and does not sufficiently distinguish noise signals and risk signals; the inventors studied the problem, and considered that the cause is two points: firstly, most of the existing methods adopt emotional statistical features or text sequences as classification model input, so that in the case of a long input sequence and associated events scattered in the input sequence, the model is difficult to learn the complex interaction relationship among news under the condition of less training data. Secondly, the model taking the serialized news text as the input has relatively weak capability of distinguishing important information from non-important information under the condition of large noise news volume.
In order to further research the problems, the invention provides a financial breading identification system fusing the characteristics of a news story line of a company. The method constructs a news story line vector representation according to a company news story line structure, and then obtains the company news text mixed attention feature through a self-attention mechanism and a cross-attention mechanism. The invention inputs the news text mixed attention characteristics of the company into the full connection layer to obtain the financial charting risk judgment result of the company.
The invention is further illustrated below with reference to the figures and examples.
As shown in fig. 1, the financial breadcrumbs recognition system fusing the news story line features of the company provided by the embodiment of the present invention includes: the system comprises a company news story line characteristic representation module 01 and a mixed attention classification module 02; the company news story line characteristic representation module 01 acquires a company news story line list by using a multilayer clustering method, and generates a story line embedded representation sequence by combining a story line tree-shaped development structure to be used as the input of the mixed attention classification module 02; the mixed attention classification module 02 acquires co-occurrence information and company management information of a news story line by using a self-attention mechanism and a company index dimension-news cross-attention mechanism, generates mixed attention vector characteristics of the news story line, and outputs a company financial risk judgment result through a full connection layer.
In this embodiment, the data sets used are manually collected corporate fraud data sets, which are divided into first year fraud data sets and random year fraud data sets according to the selected corporate fraud year. 80% of the data in both data sets were randomly selected as training sets, and the remaining 20% were selected as test sets. The companies involved in the training set and the testing set in the two data sets are the same.
In this embodiment, the news event similarity weight and the news location similarity weight β are 1 Is 0.2, the news entity similarity weight beta 2 Is 0.2, news time similarity weight beta 3 Is 0.2, the news keyword similarity β 4 The weight is 0.25 and the storyline branch attenuation coefficient alpha is 0.3. The company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule, wherein the company news story line extraction submodule extracts a plurality of company news story line structures from a company news set. The specific extraction method comprises the following steps: given company c's historical newsletter:
Figure BDA0003850873290000061
wherein each component represents historical news, | D c I represents the total number of historical news, a superscript c represents a corresponding company c, keywords of each piece of news are extracted, the keywords are used as graph nodes, edges are established among the nodes with the co-occurrence times exceeding a threshold value to form a keyword graph, the edges which have small contribution to the connectivity of the nodes in the graph are sequentially deleted in the keyword graph according to the betweenness centrality of the edges, and the divided keyword subgraphs are topic keyword graphs; dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:
Figure BDA0003850873290000062
where each component represents a topic newslet, ti represents the ith topic,
Figure BDA0003850873290000063
representing the total number of topic news, the superscript c represents the corresponding company c, constructing a news document association diagram for news documents belonging to the same topic, extracting the release time, the release location and the related entities of each news document, and extracting the news sets belonging to the same topic
Figure BDA0003850873290000064
News document of
Figure BDA0003850873290000065
According to place text similarity
Figure BDA0003850873290000066
Entity text similarity
Figure BDA0003850873290000067
Similarity of keywords
Figure BDA0003850873290000071
And publication time similarity
Figure BDA0003850873290000072
Calculating the comprehensive similarity of the news documents:
Figure BDA0003850873290000073
wherein, the four methods for calculating the similarity refer to the researches of fan laughing ice and the like [1] ,β 1234 And for the customized similarity weight, connecting the documents with high similarity in the association diagram. Community discovery algorithm through Louvain in document association graph [2] Discovery of multiple news document sets belonging to different stories
Figure BDA0003850873290000074
For a company story news document set
Figure BDA0003850873290000075
Algorithm by maximum spanning tree in its associated subgraph [3] Acquiring a tree-shaped development structure of a news story line; according to the tree structure of the company story line, a story embedded representation is constructed, and the specific construction rule is as follows: defining the event line with the longest time span in the story line as a main event line, taking the rest as branch event lines, defining the initial node vector of each node in the story line as the corresponding news headline vector, and defining the vector representation of the initial node of the branch event lineComprises the following steps:
Figure BDA0003850873290000076
wherein,
Figure BDA0003850873290000077
for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,
Figure BDA0003850873290000078
is a branch of
Figure BDA0003850873290000079
Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b 1 The vector representation of each node in the main event line of the story line can be obtained by recursive weighted combination calculation, and the mean value of the vector representation of each node on the main event line
Figure BDA00038508732900000710
An embedded representation of the storyline as a whole.
In this embodiment, the mixed attention classification module includes a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule in the company news story line feature representation module, where the news story line self-attention submodule constructs a matrix S composed of a company news story line into a volume sequence c Input self-attention network SATT [4] Obtaining updated company news story line feature representation vector e through full connection layer after global averaging satt (ii) a The company index dimension-news story line cross attention submodule obtains story line attention distribution by scaling a company index analysis dimension key word feature matrix and a news story line topic key word matrix in a dot product mode, and obtains a story line attention distribution matrix and a storyMultiplying the linear quantity sequences and obtaining a company business dimension news characterization vector through global averaging; the method comprises the following steps: news storyline volume sequence S for a given company c c News story line topic matrix P c And a company index analysis dimension matrix:
Figure BDA00038508732900000711
wherein,
Figure BDA00038508732900000712
is dimension a i The weight of the feature vector is calculated by the mean value of the relevant index of the company c in the dimension i after z standardization,
Figure BDA00038508732900000713
is dimension a i The feature vector of (2) is obtained by mean calculation of the related keyword vectors of the indexes under the dimensionality, n asp Is the number of index dimensions. First, a cross-attention distribution is calculated using a scaled dot product and a softmax function
Figure BDA0003850873290000081
d wemb Obtaining a story line cross attention weighting matrix by multiplying the attention distribution and corresponding elements of the story line vector sequence; secondly, obtaining a news story line cross attention vector e by a story line cross attention weighting matrix in a global average mode xatt (ii) a And finally, fusing the self-attention story line characteristic vector and the cross-attention story line characteristic vector in a splicing mode and taking the fused self-attention story line characteristic vector and the cross-attention story line characteristic vector as input of a full-connection neural network layer to obtain a judgment result of the financial overhead risk of the company.
In this embodiment, the story line embedding dimension is 1024, and the word vector dimension is 300. The self-attention embedding dimension is set to 250. The cross-attention embedding dimension is set to 250. The self-attention fully-connected layer dimension is set to 30 and the cross-attention fully-connected layer dimension is 20. In both data sets, the batch size of the model was set to 16, using Adam as the model optimizer.
In order to test the financial applause identification system fusing the news story line characteristics of the company, the example uses a manually collected company fraud data set for experiment, and the data set covers 42 matched negative sample companies and 42 matched companies reporting financial applause in the presence years penalized by a regulatory agency from 2007 to 2017. The first year fraud data set includes 8630 news items and the random year fraud data set includes 9040 news items. The effectiveness and the advantages of the system are verified in multiple angles by means of design comparison experiments, comparison with other baseline experiments and the like, wherein the baseline experiments respectively adopt Decision Trees (DT), random Forests (RF), logistic Regression (LR), XGboost (XGB), support Vector Machines (SVM) and full-connection neural network Models (MLP) based on company index characteristics, news emotion characteristics and company index-news emotion combination characteristics. And a convolutional neural network model (S-CNN) based on a news headline sentence vector, a neural network model (BW-NN) based on company news TF-IDF bag-of-words features, a self-attention model (SANN) based on news story line embedding, and a cross-attention model (XANN). The specific experiment is an average result of five times of training of the model, and the model accuracy, the recall rate, the F1 score and the AUC score are used as indexes for displaying. Compared with the existing baseline model, in the random annual data set, the mixed attention Model (MANN) based on the embedded characteristics of the news story line is superior to the baseline model in all indexes, and the experimental result of the annual data set is shown in the table 1.
TABLE 1 Performance comparison of different baseline models and News storyline MANN models in a first year dataset
Figure BDA0003850873290000082
Figure BDA0003850873290000091
TABLE 2 Performance comparison of different baseline models and News storyline MANN model in random annual data set
Figure BDA0003850873290000092
To verify the actual contributions of the news self-attention module and the company index-news cross-attention module to the final experimental results of the model in the present invention, we constructed two variants XANN and SANN based on MANN. The SANN leaves the self-attention module unchanged, and deletes the company index-news cross-attention module, thereby testing the actual effect of the self-attention mechanism. XANN leaves the cross attention module unchanged and uses only the self attention module for news feature extraction. CNN and DNN are baseline models, CNN uses convolution layer to extract news story line characteristics, DNN adopts global pooling and extracts news story line characteristics. The experimental comparison results are shown in table 3: it can be seen that the self-attention module and the cross-attention module both promote the promotion of the experimental results.
TABLE 3 Mixed attention model Module contribution validation
Figure BDA0003850873290000101
In conclusion, the invention analyzes and explores the characteristics of the company news story line, provides a novel financial chalk identification system which integrates the characteristics of the company news story line, extracts the story line of news in the company report period, embeds the story line into a representation structure to obtain the characteristics of the company news story line, and extracts important characteristics based on a self-attention mechanism and a cross-attention mechanism to realize the chalk risk judgment. The accuracy of the generated decoration identification result on the random annual fraud data is higher than that of the existing method.
Although the present invention has been described in connection with the preferred embodiments, it is not intended to be limited thereto. Any person skilled in the art can make possible variations and modifications to the invention using the above disclosed methods and technical content without departing from the spirit and scope of the invention, and therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the invention shall fall within the protection scope of the technical solution of the invention, unless departing from the content of the technical solution of the invention.
Reference to the literature
[1] Fan laughing ice, bouon, king large, li rui xiang, etc. based on the named entity sensitive hierarchical news story line generation method [ J ]. Chinese information bulletin, 2021,35 (01): 113-124.
[2]Blondel V D,Guillaume J L,Lambiotte R,et al.Fast unfolding of communities in large networks[J].Journal of statistical mechanics:theory and experiment,2008,2008(10):P10008.
[3] Zhouwei, huangde, gaojie, et al. Chinese dependency analysis of the combination of maximum spanning Tree and decision-making algorithms [ J ] Chinese information report, 2012,26 (03): 16-21.
[4]Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J].Advances in neural information processing systems,2017,30。

Claims (5)

1. A financial charm identification system fused with company news story line features is characterized by comprising a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module acquires a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and acquires a company financial breading risk judgment result through a full-connection classification layer; the news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree.
2. The financial breading identification system as claimed in claim 1 wherein the company news story line characteristic representation module comprises a company news story line extraction sub-module and a news story line characteristic sub-module; wherein:
the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects of topics, entities and time; the method specifically comprises the following steps:
given company c's historical newsletter:
Figure FDA0003850873280000011
where each component represents historical news, | D c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news document; sequentially deleting the edges which contribute less to the connectivity of the nodes in the graph according to the betweenness centrality of the edges in the keyword graph, wherein the divided keyword subgraphs are topic keyword graphs;
dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:
Figure FDA0003850873280000012
where each component represents a topic newslet, ti represents the ith topic,
Figure FDA0003850873280000013
representing the total number of topic news, the superscript c representing the corresponding company c;
establishing a news document association graph for news documents belonging to the same topic, and extracting the release time, the release location and the related entities of each news document; in particular, for news sets belonging to the same topic
Figure FDA0003850873280000014
News document of
Figure FDA0003850873280000015
According to place text similarity
Figure FDA0003850873280000016
Entity text similarity
Figure FDA0003850873280000017
Similarity of keywords
Figure FDA0003850873280000018
And publication time similarity
Figure FDA0003850873280000019
Calculating the comprehensive similarity of the news documents:
Figure FDA00038508732800000110
wherein, beta 1234 For the self-defined similarity weight, connecting the documents with high similarity in the association diagram; a plurality of news document sets belonging to different stories are discovered in the document association graph through a Louvain community discovery algorithm:
Figure FDA00038508732800000111
where each component represents a story newsletter, si represents the ith story,
Figure FDA00038508732800000112
representing the total number of story news, the superscript c representing the corresponding company c, for a set of company story news documents
Figure FDA0003850873280000021
Acquiring a tree-shaped development structure of a news story line in an associated subgraph through a maximum spanning tree algorithm, and defining the tree-shaped structure of the story line as a story tree;
the news story line quantity representation submodule constructs story embedding representation according to a tree structure of a company story line, and the specific construction rule is as follows: defining an event line with the longest time span in a story line as a main event line, taking the rest as branch event lines, defining an initial node vector of each node in the story line as a corresponding news headline vector, and defining the vector representation of the initial node of each branch event line as:
Figure FDA0003850873280000022
wherein,
Figure FDA0003850873280000023
for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,
Figure FDA0003850873280000024
is a branch
Figure FDA0003850873280000025
Is expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b 1 Based on equation (2), vector representation of each node in the main event line of the story line can be obtained by recursively weighted combination calculation, and then embedding representation of the news story tree is the mean value of the embedded representation of each node on the main event line:
Figure FDA0003850873280000026
3. the financial breadcrumbling identification system as recited in claim 2, wherein the hybrid attention classification module comprises a news story line self-attention sub-module and a company index dimension-news story line cross-attention sub-module; the news story line self-attention submodule enables the news story line of a company to be a matrix formed by quantity sequences
Figure FDA0003850873280000027
Input self attentionNetwork SATT to obtain updated company news story line feature representation vectors: e.g. of a cylinder satt =SATT(S c ) Wherein n is sl Number of news story lines, d, representing company c semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.
4. A financial breading identification system according to claim 3 wherein the specific process flow of the hybrid attention classification module is:
from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensionalities, and marking the index dimensionalities as index dimensionalities
Figure FDA0003850873280000028
n asp For the index dimension number, each dimension corresponds to a plurality of dimension keywords, and dimension a is recorded k Corresponding key word is
Figure FDA0003850873280000029
Corresponding word vector is
Figure FDA00038508732800000210
Figure FDA00038508732800000211
Is dimension a k And calculating the average value of the keyword vectors according to the number of the corresponding keywords to obtain the financial dimension characteristics:
Figure FDA00038508732800000212
dimension matrix for company marking index analysis
Figure FDA0003850873280000031
Wherein d is wemb In the form of a vector dimension of a word,
Figure FDA0003850873280000032
is dimension a k The feature vector of (2) is weighted, let sample i be in dimension a k The company index is
Figure FDA0003850873280000033
Figure FDA0003850873280000034
Is dimension a k The number of the indexes to be processed,
Figure FDA0003850873280000035
by
Figure FDA0003850873280000036
Calculating the mean value after z standardization; in the construction process of the news story line, the topic t corresponding to the story line is obtained p And topic keywords
Figure FDA0003850873280000037
Figure FDA0003850873280000038
Is t p Corresponding number of keywords, keyword vector of
Figure FDA0003850873280000039
Calculating to obtain topic t by taking the topic keyword word vector mean value as a news story line topic vector p Represents:
Figure FDA00038508732800000310
thus, the news story topic matrix for company c:
Figure FDA00038508732800000311
Figure FDA00038508732800000312
wherein n is sl Number of news story lines for a company;
news storyline volume sequence S for a given company c c News story line topic matrix P c And company index analysis dimension matrix A c (ii) a Key matrix S c W s
Figure FDA00038508732800000313
d xatt Embed dimensions for attention;
first, a topic matrix P is calculated by scaling the dot product as an alignment function c And index analysis dimension matrix A c Similarity of (2):
Figure FDA00038508732800000314
and mapping the similarity to attention weight by utilizing a softmax function, and obtaining a news story line characteristic matrix related to the financial index based on the similarity weight:
X S =softmax(f(A c ,P c ))S c W s , (7)
Figure FDA00038508732800000315
d xatt embed dimensions for attention;
then global averaging is carried out to obtain a feature vector e of cross attention of news xatt
Will self-attention vector e satt And cross attention vector e xatt Respectively passing through a full connecting layer to obtain:
h satt =W satt e satt +b satt
h xatt =W xatt e xatt +b xatt ; (8)
wherein, W satt 、W xatt Weight matrix being a full connection layer, b satt 、b xatt A bias vector for a fully connected layer;
finally, splicing the two vectors of the characteristic fusion layer to obtain story line mixed expression:
h matt =[h satt ,h xatt ], (9)
converting mixed representations into predictive probability distributions belonging to different classes
Figure FDA00038508732800000316
Figure FDA00038508732800000317
Wherein, W o As a weight matrix of the output layer, b o And obtaining the judgment result of the financial charting risk of the company as a bias item.
5. A financial breading identification system as claimed in claim 4 wherein the workflow is:
the story line characteristic representation module takes the company news story line as input and constructs story line vector recursion by combining a story line tree-shaped development structure to represent the quantity characteristic by the story line vector;
and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breadcrumbling risk through the full connection layer based on the characteristic fusion.
CN202211133361.4A 2022-09-17 2022-09-17 Financial affairs whitewash identification system fusing company news story line characteristics Pending CN115544238A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211133361.4A CN115544238A (en) 2022-09-17 2022-09-17 Financial affairs whitewash identification system fusing company news story line characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211133361.4A CN115544238A (en) 2022-09-17 2022-09-17 Financial affairs whitewash identification system fusing company news story line characteristics

Publications (1)

Publication Number Publication Date
CN115544238A true CN115544238A (en) 2022-12-30

Family

ID=84728579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211133361.4A Pending CN115544238A (en) 2022-09-17 2022-09-17 Financial affairs whitewash identification system fusing company news story line characteristics

Country Status (1)

Country Link
CN (1) CN115544238A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402630A (en) * 2023-06-09 2023-07-07 深圳市迪博企业风险管理技术有限公司 Financial risk prediction method and system based on characterization learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402630A (en) * 2023-06-09 2023-07-07 深圳市迪博企业风险管理技术有限公司 Financial risk prediction method and system based on characterization learning
CN116402630B (en) * 2023-06-09 2023-09-22 深圳市迪博企业风险管理技术有限公司 Financial risk prediction method and system based on characterization learning

Similar Documents

Publication Publication Date Title
Chang et al. Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor
Li et al. Document representation and feature combination for deceptive spam review detection
Zhao et al. Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder
Chan et al. A text-based decision support system for financial sequence prediction
Liu et al. Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM
CN109543034B (en) Text clustering method and device based on knowledge graph and readable storage medium
Mewada et al. Research on false review detection methods: A state-of-the-art review
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN114254201A (en) Recommendation method for science and technology project review experts
Huang et al. Identification of topic evolution: network analytics with piecewise linear representation and word embedding
Archchitha et al. Opinion spam detection in online reviews using neural networks
CN115544238A (en) Financial affairs whitewash identification system fusing company news story line characteristics
CN107609921A (en) A kind of data processing method and server
JP5933863B1 (en) Data analysis system, control method, control program, and recording medium
Zeng et al. A framework for WWW user activity analysis based on user interest
Invernici et al. Exploring the evolution of research topics during the COVID-19 pandemic
Zishumba Sentiment Analysis Based on Social Media Data
Anastasopoulos et al. Computational text analysis for public management research: An annotated application to county budgets
CN113869038A (en) Attention point similarity analysis method for Baidu stick bar based on feature word analysis
Toraman Early prediction of public reactions to news events using microblogs
Chen et al. Towards accurate search for e-commerce in steel industry: a knowledge-graph-based approach
Hosaka et al. An analytical model of website relationships based on browsing history embedding considerations of page transitions
Li et al. Sensitivity of abacus and Chasdaq in the Chinese stock market through analysis of Weibo sentiment related to Corona-19
Mirasdar et al. Graph of Words Model for Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination