CN115544238A - Financial affairs whitewash identification system fusing company news story line characteristics - Google Patents
Financial affairs whitewash identification system fusing company news story line characteristics Download PDFInfo
- Publication number
- CN115544238A CN115544238A CN202211133361.4A CN202211133361A CN115544238A CN 115544238 A CN115544238 A CN 115544238A CN 202211133361 A CN202211133361 A CN 202211133361A CN 115544238 A CN115544238 A CN 115544238A
- Authority
- CN
- China
- Prior art keywords
- news
- company
- story line
- attention
- story
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 27
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 36
- 238000004458 analytical method Methods 0.000 claims description 15
- 238000011161 development Methods 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 102100032884 Neutral amino acid transporter A Human genes 0.000 claims description 5
- 101710160582 Neutral amino acid transporter A Proteins 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 abstract 1
- 238000012549 training Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000008451 emotion Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 241000764238 Isis Species 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- HZYFHQOWCFUSOV-IMJSIDKUSA-N Asn-Asp Chemical compound NC(=O)C[C@H](N)C(=O)N[C@@H](CC(O)=O)C(O)=O HZYFHQOWCFUSOV-IMJSIDKUSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 235000012813 breadcrumbs Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics. The system of the invention comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the feature representation module obtains a company news story line by using a multilayer clustering method and obtains a news story line vector representation by recursively weighting and summing news title sentence vectors according to a news story line tree structure; the mixed attention classification module obtains different representations of the company news story line by using a self-attention mechanism and a company index dimension-news story line cross-attention mechanism, and obtains a company news characteristic vector through splicing and fusion and further obtains a company financial whitewash risk judgment result. According to the method, through the discovery of the associated information among the news of the companies, the complexity of the model is reduced, and the accuracy of the model for outputting the risk index is improved; through the multi-angle mining company news risk signal, the influence of non-whitewash risk negative news on the judgment result is reduced.
Description
Technical Field
The invention belongs to the technical field of financial chalk identification, and particularly relates to a financial chalk identification system fusing company news story line characteristics.
Background
A financial chalk identification system fusing company news story line characteristics belongs to a text classification method in the field of financial chalk identification. The method is defined as extracting the financial expense risk characteristics of the company according to external news reports or text information of annual newspapers in the operating period of the company, judging the financial expense risk of the company according to the extracted financial expense risk characteristics, and greatly depending on the selection of the text sources of the company and the representation of the characteristics. The traditional method for identifying the applause based on the emotional characteristics reflects the external evaluation on the company condition or the internal high-management attitude through a highly generalized text emotion index. The method can provide supplementary information for company indexes and assist the identification of the financial decorations of the company. However, for the news characteristics of companies, in the process of marketing company operation, the number of reports related to media supervision and self-disclosure of companies is large, and the highly generalized emotion indicators are difficult to capture the complex risk information reflected by a large amount of news.
Recently, with the development of deep learning, text embedding features have been applied to many natural language processing tasks and can achieve better results. The embedding of the text means that the grammatical and semantic information of words and sentences can be effectively reflected by learning the organization relationship between the interior of the text and the context thereof. The text embedded representation may capture more complex interactions between texts than the emotional statistical features. In the application problem with less training data, the text pre-training embedded features based on the universal corpus can also improve the effect of an application model by introducing text implicit information. Models based on text embedding features mostly adopt the following framework: the method comprises the steps of firstly, carrying out embedded expression on a general corpus training word, then taking a text vector as the input of a downstream task, and carrying out fine adjustment or direct migration on a pre-training text vector on the corpus of an application task for use. For the task with more sufficient training data, a better effect can be achieved by directly modeling the input word and sentence embedding sequence. However, financial overhead recognition based on company news is a special application problem, in which there are fewer training samples available and a single training sample corresponds to a larger amount of text input. In such a multi-document joint classification task, a complex model directly using a word or a sentence as a basic input unit is prone to a problem of under-fitting, and it is difficult to learn the association between a text feature and a problem target. Therefore, how to capture the interactive relationship in the news of a large number of companies in advance, reducing the feature quantity to be trained of the model is an important factor for improving the effect of the pink identification. In addition, news in the business period of a company contains various items such as propaganda, supervision and management, most of the news are noise texts unrelated to the company whitewash risk, and the judgment efficiency and effect of the risk identification model are greatly influenced by the existence of noise information. Therefore, how to reduce the influence of noise information is also an important factor for improving the recognition effect. The attention mechanism can give more attention to important information in data with more noise, but in different application problems, how to realize the attention module to realize the attention to target information still needs to be further designed to ensure that the attention module can capture effective problem features.
In summary, although the existing work makes some progress on the task of financial whitewashing identification based on company-associated text information, the use of news text features is not fully exploited, and the identification effect needs to be further improved.
Disclosure of Invention
The invention aims to provide a financial transaction identification system fusing company news story line characteristics, and aims to solve the problem that the existing financial transaction identification method is insufficient in utilization of external news characteristics.
The invention provides a financial overhead identification system fusing company news story line characteristics, which comprises: the system comprises a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module obtains a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and obtains a company financial whitewash risk judgment result through a full-connection classification layer. The news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree;
the company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule;
the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects such as topics, entities, time and the like; the method specifically comprises the following steps:
given company c's historical newscast:where each component represents historical news, | D c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news documents; deleting edges which contribute less to the connectivity of the nodes in the graph in turn in the keyword graph according to the betweenness centrality of the edges, wherein the divided keyword subgraphs are topic keyword graphs;
dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:where each component represents a topic newslet, ti represents the ith topic,representing the total number of topic news, the superscript c representing the corresponding company c;
establishing news document association graph for news documents belonging to the same topic, and extracting release time and release place of each news documentAnd to an entity. In particular, for news sets belonging to the same topicNews document ofAccording to place text similarityEntity text similaritySimilarity of keywordsAnd publication time similarityCalculating the comprehensive similarity of news documents:
the four similarity calculation methods are conventional methods (for example, refer to the study of fan laughing ice, etc.) [1] ),β 1 ,β 2 ,β 3 ,β 4 And connecting the documents with high similarity in the association diagram for self-defining the similarity weight. Community discovery algorithm through Louvain in document association graph [2] Multiple sets of news documents belonging to different stories are found:where each component represents a story newsletter, si represents the ith story,representing the total number of story news, the superscript c representing the corresponding company c, for a set of company story news documentsAlgorithm by maximum spanning tree in its associated subgraph [3] And acquiring a tree-shaped development structure of the news story line, and defining the tree-shaped development structure of the story line as a story tree.
The news story line quantity representation submodule constructs story embedded representation according to a tree structure of a company story line, and the specific construction rule is as follows: the news story line quantity representation submodule defines an event line with the longest time span in the story line as a main event line, the rest of the story lines are branch event lines, an initial node vector of each node in the story line is defined as a corresponding news headline sentence vector, and vector representation of an initial node of each branch event line is defined as:
wherein,for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,is a branch ofIs expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in a story tree, and marking the path as b 1 Based on formula (2), vector representation e of each node in the main event line of the story line can be obtained through recursive weighted combination calculation i Then a news story treeIs expressed as the mean value of the embedded representation of each node on the trunk event line:
the mixed attention classification module comprises a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule; the news story line self-attention submodule leads the news story line of the company to a matrix formed by a quantitative sequenceInput self-attention network SATT [4] To obtain an updated company news story line feature representation vector: e.g. of a cylinder satt =SATT(S c ) Wherein n is sl Number of news story lines, d, representing company c semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.
The specific process of the mixed attention classification module comprises the following steps: from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensions, and recording the index dimensions asn asp For the index dimension number, each dimension corresponds to a plurality of dimension keywords, and dimension a is recorded k Corresponding key word isThe corresponding word vector isIs dimension a k The number of corresponding keywords, calculatingAnd (3) obtaining financial dimension characteristics by the average value of the keyword vectors:
dimension matrix for company marking index analysisWherein, d wemb In the form of a vector dimension of a word,is dimension a k The weight of the feature vector of (1), let sample i be in dimension a k The following company index isIs dimension a k The number of the indexes to be used is,byCalculating the mean value after z standardization; in the construction process of the news story line, topics t corresponding to the story line can be obtained p And topic keywordsIs t p Corresponding number of keywords, keyword vector ofCalculating to obtain topic t by taking the topic keyword word vector mean value as a news story line topic vector p Represents:
News storyline volume sequence S for a given company c c (already mentioned above with S) c Source), news storyline topic matrix P c And company index analysis dimension matrix A c (ii) a Key matrixd xatt Embed dimension for attention);
first, a topic matrix P is calculated by scaling the dot product as an alignment function c And index analysis dimension matrix A c Similarity of (2):
and mapping the similarity to attention weight by utilizing a softmax function, and obtaining a news story line characteristic matrix related to the financial index based on the similarity weight:
X S =softmax(f(A c ,P c ))S c W s , (7)
then obtaining a news cross attention feature vector e through global averaging xatt 。
Will self-attention vector e satt And cross attention vector e xatt Respectively passing through a full connecting layer:
b satt =W satt e satt +b satt ,
h xatt =W xatt e xatt +b xatt ; (8)
wherein, W satt 、W xatt Weight matrix being a full connection layer, b satt 、b xatt Is the bias vector for the fully connected layer.
Finally, splicing the two part vectors by the feature fusion layer to obtain story line mixed representation:
h matt =[b satt ,h xatt ], (9)
converting mixed representations into predictive probability distributions belonging to different classes
Wherein, W o As a weight matrix of the output layer, b o And obtaining the judgment result of the financial charting risk of the company as a bias item.
The invention provides a company financial affair whitewash identification system fusing news story line characteristics, which comprises the following working processes:
the story line characteristic representation module takes the company news story line as input and constructs story line quantity characteristic representation by combining a story line tree-shaped development structure through a news heading sentence vector recursion structure;
and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of the company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breading risks through the full connection layer based on the characteristic fusion.
The advantages of the invention include:
(1) The method is characterized in that relevance information among news of companies is discovered explicitly, and model complexity is reduced-most existing news multi-document combined classification methods take word or sentence vector sequences as classification model input, which means that parameters of the models are increased along with the increase of input sequences, and when the input characteristic sequences are long, the models can obtain good classification results by needing enough training data. For a scene with less training data of a model of company financial charm, a vector sequence of news story line granularity is used as input, so that the problem of under-fitting of the model is solved, and the accuracy of the model output risk index is improved. Meanwhile, the story line structure also introduces the correlation information among news;
(2) The risk signals of the news of the company are mined from multiple angles, the news of the company in the business period covers various items such as business, publicity, supervision and the like, and the risk signals contained in different news are different. The existing method does not distinguish the provided risk indexes from the perspective of topics and contents of news, and also ignores risk signals generated by co-occurrence information of news. In fact, the topic of the company news can help the model to judge the importance of the related news, and the influence of the non-whitewash risk negative news on the judgment result is reduced. The co-occurrence information of news can reflect hidden risk signals, for example, the occurrence of interest and interest messages in the same time period reduces the reliability of the public information of companies. Therefore, designing a feature processing module that can consider news topic information and capture co-occurrence signals for company news data is the key to improving the accuracy of risk signals. The self-attentional and cross-attentional mechanisms can efficiently extract the above information by assigning attention weights.
Drawings
Fig. 1 is a framework diagram of the present invention.
FIG. 2 is a diagram of a hybrid attention classification model framework.
Detailed Description
As known in the background art, the existing financial whiteware identification model based on text classification does not fully capture the complex interaction relationship between news texts, and does not sufficiently distinguish noise signals and risk signals; the inventors studied the problem, and considered that the cause is two points: firstly, most of the existing methods adopt emotional statistical features or text sequences as classification model input, so that in the case of a long input sequence and associated events scattered in the input sequence, the model is difficult to learn the complex interaction relationship among news under the condition of less training data. Secondly, the model taking the serialized news text as the input has relatively weak capability of distinguishing important information from non-important information under the condition of large noise news volume.
In order to further research the problems, the invention provides a financial breading identification system fusing the characteristics of a news story line of a company. The method constructs a news story line vector representation according to a company news story line structure, and then obtains the company news text mixed attention feature through a self-attention mechanism and a cross-attention mechanism. The invention inputs the news text mixed attention characteristics of the company into the full connection layer to obtain the financial charting risk judgment result of the company.
The invention is further illustrated below with reference to the figures and examples.
As shown in fig. 1, the financial breadcrumbs recognition system fusing the news story line features of the company provided by the embodiment of the present invention includes: the system comprises a company news story line characteristic representation module 01 and a mixed attention classification module 02; the company news story line characteristic representation module 01 acquires a company news story line list by using a multilayer clustering method, and generates a story line embedded representation sequence by combining a story line tree-shaped development structure to be used as the input of the mixed attention classification module 02; the mixed attention classification module 02 acquires co-occurrence information and company management information of a news story line by using a self-attention mechanism and a company index dimension-news cross-attention mechanism, generates mixed attention vector characteristics of the news story line, and outputs a company financial risk judgment result through a full connection layer.
In this embodiment, the data sets used are manually collected corporate fraud data sets, which are divided into first year fraud data sets and random year fraud data sets according to the selected corporate fraud year. 80% of the data in both data sets were randomly selected as training sets, and the remaining 20% were selected as test sets. The companies involved in the training set and the testing set in the two data sets are the same.
In this embodiment, the news event similarity weight and the news location similarity weight β are 1 Is 0.2, the news entity similarity weight beta 2 Is 0.2, news time similarity weight beta 3 Is 0.2, the news keyword similarity β 4 The weight is 0.25 and the storyline branch attenuation coefficient alpha is 0.3. The company news story line characteristic representation module comprises a company news story line extraction submodule and a news story line quantity representation submodule, wherein the company news story line extraction submodule extracts a plurality of company news story line structures from a company news set. The specific extraction method comprises the following steps: given company c's historical newsletter:wherein each component represents historical news, | D c I represents the total number of historical news, a superscript c represents a corresponding company c, keywords of each piece of news are extracted, the keywords are used as graph nodes, edges are established among the nodes with the co-occurrence times exceeding a threshold value to form a keyword graph, the edges which have small contribution to the connectivity of the nodes in the graph are sequentially deleted in the keyword graph according to the betweenness centrality of the edges, and the divided keyword subgraphs are topic keyword graphs; dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:where each component represents a topic newslet, ti represents the ith topic,representing the total number of topic news, the superscript c represents the corresponding company c, constructing a news document association diagram for news documents belonging to the same topic, extracting the release time, the release location and the related entities of each news document, and extracting the news sets belonging to the same topicNews document ofAccording to place text similarityEntity text similaritySimilarity of keywordsAnd publication time similarityCalculating the comprehensive similarity of the news documents:
wherein, the four methods for calculating the similarity refer to the researches of fan laughing ice and the like [1] ,β 1 ,β 2 ,β 3 ,β 4 And for the customized similarity weight, connecting the documents with high similarity in the association diagram. Community discovery algorithm through Louvain in document association graph [2] Discovery of multiple news document sets belonging to different storiesFor a company story news document setAlgorithm by maximum spanning tree in its associated subgraph [3] Acquiring a tree-shaped development structure of a news story line; according to the tree structure of the company story line, a story embedded representation is constructed, and the specific construction rule is as follows: defining the event line with the longest time span in the story line as a main event line, taking the rest as branch event lines, defining the initial node vector of each node in the story line as the corresponding news headline vector, and defining the vector representation of the initial node of the branch event lineComprises the following steps:
wherein,for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,is a branch ofIs expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b 1 The vector representation of each node in the main event line of the story line can be obtained by recursive weighted combination calculation, and the mean value of the vector representation of each node on the main event lineAn embedded representation of the storyline as a whole.
In this embodiment, the mixed attention classification module includes a news story line self-attention submodule and a company index dimension-news story line cross-attention submodule in the company news story line feature representation module, where the news story line self-attention submodule constructs a matrix S composed of a company news story line into a volume sequence c Input self-attention network SATT [4] Obtaining updated company news story line feature representation vector e through full connection layer after global averaging satt (ii) a The company index dimension-news story line cross attention submodule obtains story line attention distribution by scaling a company index analysis dimension key word feature matrix and a news story line topic key word matrix in a dot product mode, and obtains a story line attention distribution matrix and a storyMultiplying the linear quantity sequences and obtaining a company business dimension news characterization vector through global averaging; the method comprises the following steps: news storyline volume sequence S for a given company c c News story line topic matrix P c And a company index analysis dimension matrix:
wherein,is dimension a i The weight of the feature vector is calculated by the mean value of the relevant index of the company c in the dimension i after z standardization,is dimension a i The feature vector of (2) is obtained by mean calculation of the related keyword vectors of the indexes under the dimensionality, n asp Is the number of index dimensions. First, a cross-attention distribution is calculated using a scaled dot product and a softmax functiond wemb Obtaining a story line cross attention weighting matrix by multiplying the attention distribution and corresponding elements of the story line vector sequence; secondly, obtaining a news story line cross attention vector e by a story line cross attention weighting matrix in a global average mode xatt (ii) a And finally, fusing the self-attention story line characteristic vector and the cross-attention story line characteristic vector in a splicing mode and taking the fused self-attention story line characteristic vector and the cross-attention story line characteristic vector as input of a full-connection neural network layer to obtain a judgment result of the financial overhead risk of the company.
In this embodiment, the story line embedding dimension is 1024, and the word vector dimension is 300. The self-attention embedding dimension is set to 250. The cross-attention embedding dimension is set to 250. The self-attention fully-connected layer dimension is set to 30 and the cross-attention fully-connected layer dimension is 20. In both data sets, the batch size of the model was set to 16, using Adam as the model optimizer.
In order to test the financial applause identification system fusing the news story line characteristics of the company, the example uses a manually collected company fraud data set for experiment, and the data set covers 42 matched negative sample companies and 42 matched companies reporting financial applause in the presence years penalized by a regulatory agency from 2007 to 2017. The first year fraud data set includes 8630 news items and the random year fraud data set includes 9040 news items. The effectiveness and the advantages of the system are verified in multiple angles by means of design comparison experiments, comparison with other baseline experiments and the like, wherein the baseline experiments respectively adopt Decision Trees (DT), random Forests (RF), logistic Regression (LR), XGboost (XGB), support Vector Machines (SVM) and full-connection neural network Models (MLP) based on company index characteristics, news emotion characteristics and company index-news emotion combination characteristics. And a convolutional neural network model (S-CNN) based on a news headline sentence vector, a neural network model (BW-NN) based on company news TF-IDF bag-of-words features, a self-attention model (SANN) based on news story line embedding, and a cross-attention model (XANN). The specific experiment is an average result of five times of training of the model, and the model accuracy, the recall rate, the F1 score and the AUC score are used as indexes for displaying. Compared with the existing baseline model, in the random annual data set, the mixed attention Model (MANN) based on the embedded characteristics of the news story line is superior to the baseline model in all indexes, and the experimental result of the annual data set is shown in the table 1.
TABLE 1 Performance comparison of different baseline models and News storyline MANN models in a first year dataset
TABLE 2 Performance comparison of different baseline models and News storyline MANN model in random annual data set
To verify the actual contributions of the news self-attention module and the company index-news cross-attention module to the final experimental results of the model in the present invention, we constructed two variants XANN and SANN based on MANN. The SANN leaves the self-attention module unchanged, and deletes the company index-news cross-attention module, thereby testing the actual effect of the self-attention mechanism. XANN leaves the cross attention module unchanged and uses only the self attention module for news feature extraction. CNN and DNN are baseline models, CNN uses convolution layer to extract news story line characteristics, DNN adopts global pooling and extracts news story line characteristics. The experimental comparison results are shown in table 3: it can be seen that the self-attention module and the cross-attention module both promote the promotion of the experimental results.
TABLE 3 Mixed attention model Module contribution validation
In conclusion, the invention analyzes and explores the characteristics of the company news story line, provides a novel financial chalk identification system which integrates the characteristics of the company news story line, extracts the story line of news in the company report period, embeds the story line into a representation structure to obtain the characteristics of the company news story line, and extracts important characteristics based on a self-attention mechanism and a cross-attention mechanism to realize the chalk risk judgment. The accuracy of the generated decoration identification result on the random annual fraud data is higher than that of the existing method.
Although the present invention has been described in connection with the preferred embodiments, it is not intended to be limited thereto. Any person skilled in the art can make possible variations and modifications to the invention using the above disclosed methods and technical content without departing from the spirit and scope of the invention, and therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the invention shall fall within the protection scope of the technical solution of the invention, unless departing from the content of the technical solution of the invention.
Reference to the literature
[1] Fan laughing ice, bouon, king large, li rui xiang, etc. based on the named entity sensitive hierarchical news story line generation method [ J ]. Chinese information bulletin, 2021,35 (01): 113-124.
[2]Blondel V D,Guillaume J L,Lambiotte R,et al.Fast unfolding of communities in large networks[J].Journal of statistical mechanics:theory and experiment,2008,2008(10):P10008.
[3] Zhouwei, huangde, gaojie, et al. Chinese dependency analysis of the combination of maximum spanning Tree and decision-making algorithms [ J ] Chinese information report, 2012,26 (03): 16-21.
[4]Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J].Advances in neural information processing systems,2017,30。
Claims (5)
1. A financial charm identification system fused with company news story line features is characterized by comprising a company news story line feature representation module and a mixed attention classification module; the company news story line characteristic representation module extracts an associated news set in a company report period by using a news story line, converts the associated news set into a story line quantity sequence and outputs the story line quantity sequence to the mixed attention classification module; the mixed attention classification module acquires a risk representation vector of a company news story line by using a self-attention mechanism and an index dimension-news mixed attention mechanism, and acquires a company financial breading risk judgment result through a full-connection classification layer; the news story line is a tree-shaped development structure, and the tree-shaped structure of the story line is defined as a story tree.
2. The financial breading identification system as claimed in claim 1 wherein the company news story line characteristic representation module comprises a company news story line extraction sub-module and a news story line characteristic sub-module; wherein:
the company news story line extraction submodule constructs a company news story line structure according to the similarity of company news in multiple aspects of topics, entities and time; the method specifically comprises the following steps:
given company c's historical newsletter:where each component represents historical news, | D c I represents the total number of the historical news, and the superscript c represents the corresponding company c; constructing a keyword graph according to the co-occurrence relation of the keywords of the news document; sequentially deleting the edges which contribute less to the connectivity of the nodes in the graph according to the betweenness centrality of the edges in the keyword graph, wherein the divided keyword subgraphs are topic keyword graphs;
dividing news into topics according to the similarity between the topic keywords and the news keywords to obtain a topic news set of a company c:where each component represents a topic newslet, ti represents the ith topic,representing the total number of topic news, the superscript c representing the corresponding company c;
establishing a news document association graph for news documents belonging to the same topic, and extracting the release time, the release location and the related entities of each news document; in particular, for news sets belonging to the same topicNews document ofAccording to place text similarityEntity text similaritySimilarity of keywordsAnd publication time similarityCalculating the comprehensive similarity of the news documents:
wherein, beta 1 ,β 2 ,β 3 ,β 4 For the self-defined similarity weight, connecting the documents with high similarity in the association diagram; a plurality of news document sets belonging to different stories are discovered in the document association graph through a Louvain community discovery algorithm:where each component represents a story newsletter, si represents the ith story,representing the total number of story news, the superscript c representing the corresponding company c, for a set of company story news documentsAcquiring a tree-shaped development structure of a news story line in an associated subgraph through a maximum spanning tree algorithm, and defining the tree-shaped structure of the story line as a story tree;
the news story line quantity representation submodule constructs story embedding representation according to a tree structure of a company story line, and the specific construction rule is as follows: defining an event line with the longest time span in a story line as a main event line, taking the rest as branch event lines, defining an initial node vector of each node in the story line as a corresponding news headline vector, and defining the vector representation of the initial node of each branch event line as:
wherein,for branching nodes v in story lines i Corresponding set of sub-branches, m i Is a node v i The starting index of the corresponding sub-branch,is a branchIs expressed by a vector, alpha is a branch attenuation coefficient of a story line, h i A heading sentence vector representing news i; defining a main event line as a path between news nodes with the largest occurrence time difference in the story tree, and marking as b 1 Based on equation (2), vector representation of each node in the main event line of the story line can be obtained by recursively weighted combination calculation, and then embedding representation of the news story tree is the mean value of the embedded representation of each node on the main event line:
3. the financial breadcrumbling identification system as recited in claim 2, wherein the hybrid attention classification module comprises a news story line self-attention sub-module and a company index dimension-news story line cross-attention sub-module; the news story line self-attention submodule enables the news story line of a company to be a matrix formed by quantity sequencesInput self attentionNetwork SATT to obtain updated company news story line feature representation vectors: e.g. of a cylinder satt =SATT(S c ) Wherein n is sl Number of news story lines, d, representing company c semb Embedding dimensions for news story lines; the company index dimension-news story line cross attention submodule obtains story line attention distribution through a scaling dot product mode of a company index analysis dimension key word feature matrix and a news story line topic key word matrix, multiplies the attention distribution matrix and a story line quantity sequence, and obtains a company management dimension news representation vector through global averaging.
4. A financial breading identification system according to claim 3 wherein the specific process flow of the hybrid attention classification module is:
from the perspective of business analysis of a company, dividing indexes of the company into a plurality of dimensionalities, and marking the index dimensionalities as index dimensionalitiesn asp For the index dimension number, each dimension corresponds to a plurality of dimension keywords, and dimension a is recorded k Corresponding key word isCorresponding word vector is Is dimension a k And calculating the average value of the keyword vectors according to the number of the corresponding keywords to obtain the financial dimension characteristics:
dimension matrix for company marking index analysisWherein d is wemb In the form of a vector dimension of a word,is dimension a k The feature vector of (2) is weighted, let sample i be in dimension a k The company index is Is dimension a k The number of the indexes to be processed,byCalculating the mean value after z standardization; in the construction process of the news story line, the topic t corresponding to the story line is obtained p And topic keywords Is t p Corresponding number of keywords, keyword vector ofCalculating to obtain topic t by taking the topic keyword word vector mean value as a news story line topic vector p Represents:
news storyline volume sequence S for a given company c c News story line topic matrix P c And company index analysis dimension matrix A c (ii) a Key matrix S c W s ,d xatt Embed dimensions for attention;
first, a topic matrix P is calculated by scaling the dot product as an alignment function c And index analysis dimension matrix A c Similarity of (2):
and mapping the similarity to attention weight by utilizing a softmax function, and obtaining a news story line characteristic matrix related to the financial index based on the similarity weight:
X S =softmax(f(A c ,P c ))S c W s , (7)
then global averaging is carried out to obtain a feature vector e of cross attention of news xatt ;
Will self-attention vector e satt And cross attention vector e xatt Respectively passing through a full connecting layer to obtain:
h satt =W satt e satt +b satt ,
h xatt =W xatt e xatt +b xatt ; (8)
wherein, W satt 、W xatt Weight matrix being a full connection layer, b satt 、b xatt A bias vector for a fully connected layer;
finally, splicing the two vectors of the characteristic fusion layer to obtain story line mixed expression:
h matt =[h satt ,h xatt ], (9)
converting mixed representations into predictive probability distributions belonging to different classes
Wherein, W o As a weight matrix of the output layer, b o And obtaining the judgment result of the financial charting risk of the company as a bias item.
5. A financial breading identification system as claimed in claim 4 wherein the workflow is:
the story line characteristic representation module takes the company news story line as input and constructs story line vector recursion by combining a story line tree-shaped development structure to represent the quantity characteristic by the story line vector;
and (II) the mixed attention classification module takes a company news story line quantity sequence as input, obtains the co-occurrence risk information of company news stories through the self-attention module, obtains story line quantity characteristics with higher correlation degree with the company operation analysis dimension through the company index dimension-news story line cross attention module, and obtains the judgment result of the company financial breadcrumbling risk through the full connection layer based on the characteristic fusion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211133361.4A CN115544238A (en) | 2022-09-17 | 2022-09-17 | Financial affairs whitewash identification system fusing company news story line characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211133361.4A CN115544238A (en) | 2022-09-17 | 2022-09-17 | Financial affairs whitewash identification system fusing company news story line characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115544238A true CN115544238A (en) | 2022-12-30 |
Family
ID=84728579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211133361.4A Pending CN115544238A (en) | 2022-09-17 | 2022-09-17 | Financial affairs whitewash identification system fusing company news story line characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115544238A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402630A (en) * | 2023-06-09 | 2023-07-07 | 深圳市迪博企业风险管理技术有限公司 | Financial risk prediction method and system based on characterization learning |
-
2022
- 2022-09-17 CN CN202211133361.4A patent/CN115544238A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402630A (en) * | 2023-06-09 | 2023-07-07 | 深圳市迪博企业风险管理技术有限公司 | Financial risk prediction method and system based on characterization learning |
CN116402630B (en) * | 2023-06-09 | 2023-09-22 | 深圳市迪博企业风险管理技术有限公司 | Financial risk prediction method and system based on characterization learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chang et al. | Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor | |
Li et al. | Document representation and feature combination for deceptive spam review detection | |
Zhao et al. | Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder | |
Chan et al. | A text-based decision support system for financial sequence prediction | |
Liu et al. | Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM | |
CN109543034B (en) | Text clustering method and device based on knowledge graph and readable storage medium | |
Mewada et al. | Research on false review detection methods: A state-of-the-art review | |
KR20200007713A (en) | Method and Apparatus for determining a topic based on sentiment analysis | |
CN110287314B (en) | Long text reliability assessment method and system based on unsupervised clustering | |
CN114254201A (en) | Recommendation method for science and technology project review experts | |
Huang et al. | Identification of topic evolution: network analytics with piecewise linear representation and word embedding | |
Archchitha et al. | Opinion spam detection in online reviews using neural networks | |
CN115544238A (en) | Financial affairs whitewash identification system fusing company news story line characteristics | |
CN107609921A (en) | A kind of data processing method and server | |
JP5933863B1 (en) | Data analysis system, control method, control program, and recording medium | |
Zeng et al. | A framework for WWW user activity analysis based on user interest | |
Invernici et al. | Exploring the evolution of research topics during the COVID-19 pandemic | |
Zishumba | Sentiment Analysis Based on Social Media Data | |
Anastasopoulos et al. | Computational text analysis for public management research: An annotated application to county budgets | |
CN113869038A (en) | Attention point similarity analysis method for Baidu stick bar based on feature word analysis | |
Toraman | Early prediction of public reactions to news events using microblogs | |
Chen et al. | Towards accurate search for e-commerce in steel industry: a knowledge-graph-based approach | |
Hosaka et al. | An analytical model of website relationships based on browsing history embedding considerations of page transitions | |
Li et al. | Sensitivity of abacus and Chasdaq in the Chinese stock market through analysis of Weibo sentiment related to Corona-19 | |
Mirasdar et al. | Graph of Words Model for Natural Language Processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |