CN111324602A

CN111324602A - Method for realizing financial big data oriented analysis visualization

Info

Publication number: CN111324602A
Application number: CN202010106706.1A
Authority: CN
Inventors: 李中棠; 蔡浩淼; 朱永佳
Original assignee: Shanghai Softline Information Technology Co ltd
Current assignee: Shanghai Softline Information Technology Co ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-23

Abstract

The invention discloses a visualization method for financial big data analysis, and relates to the technical field of financial big data analysis. The invention comprises the following steps: s01, metadata collection management; s02, managing data quality; s03, standardizing data; s04, managing a data warehouse; s05, visualizing the data; s06, analyzing data; the invention aggregates mass data to realize the mining of weak correlation of the data, generates more data value, better performs relevant calculation analysis on the acquired data and upgrades the algorithm and the model, and enables the multidimensional analysis and dynamic analysis of the system to be displayed visually through data management.

Description

Method for realizing financial big data oriented analysis visualization

Technical Field

The invention belongs to the technical field of financial big data analysis, and particularly relates to a visualization method for realizing financial big data analysis.

Background

With the rapid growth of the data scale of the internet, data with uncertain relationships is difficult to manage, describe and analyze by traditional means. Meanwhile, the big data is not simple statistics and analysis in informatization, but massive data is aggregated to realize the mining of weak correlation of the data, so that more data values are generated.

In order to better perform relevant calculation analysis on the acquired data and upgrade the algorithm and the model, the data management work is very critical, and a good data management system is a key link for multidimensional analysis and dynamic analysis in the financial industry.

The construction content of the financial big data base platform system mainly comprises modules of metadata management, data quality management, data standardization, data warehouse management, data visualization, data analysis and the like;

the metadata is all basic information of the data, for example, if a person information base needs to be established, basic information of name, sex, date of birth, height, weight and the like is needed, each item of basic information is metadata, and effective and complete data information consists of a plurality of metadata; due to the situations of mass improvement of data, reduction of data quality, lack of timeliness of data, imperfect algorithm model and the like, fusion analysis, data situation prediction and the like of the existing financial big data cluster need to be improved. The individual and the group are analyzed, classified and classified, and the key points are distinguished, so as to achieve the expected prejudgment analysis, intellectualization and automation work.

In the current financial field, financial data grows exponentially, and data with uncertain relations are difficult to manage, describe and analyze by traditional means. Meanwhile, the financial data is not simply counted and analyzed in informatization, but massive data is aggregated to realize the mining of weak data correlation, and more data values are generated. In order to better perform relevant calculation analysis on the acquired data and upgrade the algorithm and the model, data management is a key link of system multidimensional analysis and dynamic analysis, so that the method has great significance for realizing the visualization of financial big data analysis.

Disclosure of Invention

The invention provides a visualization method for realizing financial big data analysis, which aims to solve the problems.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention discloses a method for realizing financial big data analysis-oriented visualization, which comprises the following steps:

s01, metadata collection management: the system is used for interacting with a financial system, acquiring initial data, and maintaining and acquiring financial metadata according to basic information, attributes and association of financial attribute data; the system provides life cycle management of metadata and version management function, thereby ensuring the quality of the metadata and the authority and reliability of a subsequent metadata system;

s02, data quality management: the system is used for managing data quality problems, comprises data quality index management, index scheduling and execution management, problem management and data quality analysis management, and is realized by an interface layer, a response layer, a functional layer, a system management layer and a storage layer;

s03, data standardization: the system is used for standardizing the acquired financial metadata to acquire standardized data, and adopts a comprehensive standardized or progressive standardized mode, and comprises the steps of standardized object selection, word standardization, domain standardization and expression standardization;

s04, data warehouse management: a step for storing the financial metadata subjected to data standardization in the form of a data warehouse, including data extraction, data cleansing conversion and data warehouse log and warning transmission;

s05, data visualization: the background is used for acquiring data in the data warehouse according to the data source, the execution condition of the data acquisition task and the backup condition of the system data; specifically, data are acquired from a background through a front-end h5 technology, analysis data are processed, a JSONObject is used for returning a result, the result returned by the background is acquired by the front end for judgment, the result is analyzed by the front end, the last link entering the background is determined, and relevant information of a data extraction task is displayed to an interface through an Echarts visualization technology;

s06, data analysis: the convolutional neural network and the text information processing network of the deep learning network model are utilized to analyze the time complexity, the space complexity and the influence on the model from three parties.

Further, in the step S01, the step of acquiring initial data includes:

t01, establishing adapter: including but not limited to JDBC, EXCEL, template, hive, DB data dictionary adapters;

t02, establishing a suspension point: a meta-model needs to be selected to be matched with the adapter;

t03, creating data source: selecting an adapter, an adapter version, an acquisition mode, a suspension point and parameter configuration; judging whether the audit is needed, if so, entering a step T04, and if not, entering a step T05;

t04, interacting with a quality retrieval system;

t05, acquisition template management: template customization and template mapping;

t06, manual acquisition: selecting a data source through collection task management, uploading a template, and starting collection; checking the acquired collection log, and displaying error reporting and progress; task configuration may be performed for automatic acquisition: setting acquisition time, running immediately and ending an acquisition process;

t07, view distribution: the distribution of collected data to different levels of the user view is managed at the view.

Further, the data quality index manages the checking rules of six typical data quality problems according to eight major element specifications of data quality, including: non-null checking, uniqueness checking, main foreign key checking, length checking, code checking and consistency checking; the data quality check comprises eight major elements: integrity, legality, uniqueness, consistency, accuracy, timeliness, safety and expansibility;

the index scheduling and execution management is a process of checking index execution, is a process of checking data quality problems existing in a source system, and finds the data quality problems existing in the system in an automatic/manual mode;

the problem management is realized by a quality problem management module, and is divided into automatic management and manual problem management of the checking problem, and the blood system analysis, the influence analysis, the detail checking, the export function and the process management are provided for the checking problem;

the data quality analysis management is used for performing distributed analysis on the result checked by the data quality, and comprises index query, view of a trend analysis view, view and download of a data quality report; through a graphical icon interface, the reasons and the historical trends generated by inquiry are quickly positioned, and assistance is provided for data management personnel to solve the data quality problem.

Further, the selecting of the standardized object in the step S03 includes the steps of: specifying a detailed execution plan, determining standardized principles, defining standardized guidelines including naming rules, selecting standardized objects, collecting source data;

the word normalization comprises the steps of: selecting a reference dictionary, analyzing morphemes, defining words, grouping English and abbreviation naming synonyms, and constructing a standard word dictionary;

the domain standard comprises an analysis data type, a domain classification and selection standard, a definition domain, a data type and a length of the definition domain, and a standard word dictionary is constructed after a standard domain dictionary is constructed;

the expression standardization comprises the steps of applying standardization to a data model, judging the compliance of the expression, defining the expression, constructing a standard expression dictionary, and then constructing a standard word dictionary.

Further, the data extraction step in the data warehouse management in step S04 is as follows: firstly, it is clear that data comes from several business systems; the database server of each business system runs what DBMS; whether manual data exists; how large the amount of manual data is; whether unstructured data exist or not, and the like, and the design of data extraction can be carried out after the information is collected, wherein the design comprises the following steps:

for the same data source processing approach as the database system storing the DWs: DBMS (SQLServer, Oracle) provides database link function, and the direct link relation is established between the DW database server and the original business system, so that the Select statement can be written for direct access;

processing methods for data sources other than DW database systems: establishing database link such as SQL Server and Oracle by ODBC mode; if the database link cannot be established, the method is completed in two ways, namely exporting the source data into txt or xls files through a tool, and then importing the source system files into the ODS; another method is through a program interface;

for a file type data source (. txt,. xls): the data are imported into a specified database by training business personnel through a database tool, and then extracted from the specified database;

for the problem of incremental updates: the service system records the time when the service occurs, and marks the time for increment, firstly judges the maximum time recorded in the ODS before each extraction, and then extracts all records more than this time according to the time.

Further, the cleansing of the data in the data warehouse management in the step S04 is used to filter those data that are not in compliance with the requirement, and includes: incomplete data, erroneous data, repeated data; the filtered result is sent to a business administration department, whether the filtered result is filtered or corrected by a business unit is determined, and then the filtered result is extracted;

the conversion of data in the database repository management in step S04 is used to perform inconsistent data conversion, data granularity conversion, and some rule calculations, including inconsistent data conversion, data granularity conversion, and business rule calculation;

the data warehouse logs in the data warehouse management in the step S04 include execution process logs, error logs, and the logs are general logs, and are used for knowing the operation condition of the data warehouse at any time and finding errors in time;

the warning transmission in step S04: the system is used for forming a data warehouse error log when the data warehouse has errors and sending a warning to a system administrator; the warning sending mode comprises sending a mail to a system administrator, attaching error information and facilitating the administrator to check errors.

Further, the data visualization in the step S05 is implemented by high-dimensional visualization components, including but not limited to a trend graph, a radar graph, a 3D scatter diagram, a network graph, a hierarchy graph, and a word cloud graph.

Further, the data analysis of step S06 adopts a model of time complexity including a single convolutional layer, a model of space complexity including an access amount, a training/prediction time determined by the time complexity, and a pointer network model; the pointer network model comprises a basic coder decoder model, an attention mechanism coder decoder model, a self-attention mechanism coder decoder model and a constructed text abstract model.

The invention has the following beneficial effects:

1. the financial metadata management system performs management such as maintenance, retrieval, updating, life cycle management, metadata import, export and the like on the financial metadata through metadata management; exporting all data needing to be exported from the financial metadata tree to a file; analysis results mainly have influence analysis, pedigree analysis, data warehouse mapping analysis and the like, and support data export and picture export functions. The version management has strict processes of whole life cycle management, release, deletion and state change of financial data, provides data life cycle management, ensures the quality of the stable data and ensures the reliability of a subsequent metadata using system.

2. According to the method, a checking rule is specified according to a data quality index through metadata quality management, the checking index is executed to find the data quality problem existing in the system, the blood system analysis, the influence analysis, the data panoramic analysis, the data engine, the process management and the like are provided for the checking problem, and finally, the distributed analysis is carried out according to the result of the data quality management to help a service expert to locate the problem; the data quality problem has four problem domains, which are the most frequent cases of the data quality problem. The financial data quality information of the four fields of the problem domain, the management problem domain, the process problem domain and the technical problem domain can be improved.

3. The invention standardizes the data from the source through data standardization, so that the meaning, the expression method, the value range and the like of each item of data in the processing and running of each system are standardized and unified, the consistency of the data from the generated source is ensured, and the information exchange and sharing are smoothly realized; the data standardization processes the financial data with chemotaxis and dimensionless financial data. The data homochemotaxis processing mainly solves the problem of data with different properties, directly sums indexes with different properties and cannot correctly reflect the comprehensive results of different acting forces, and firstly considers changing the data properties of inverse indexes to ensure that all the indexes are homochemotactic for the acting forces of the evaluation scheme and then sum to obtain correct results. The data dimensionless process mainly addresses the comparability of data. The data normalization method used Z-score normalization. Through the standardization processing, the original data are converted into non-dimensionalized index mapping evaluation values, namely, all the index values are in the same quantity level, and the efficiency of comprehensive evaluation analysis is improved.

4. According to the financial data management system, the financial data is stored in a centralized manner through the management of the database warehouse according to different sources, formats and characteristic properties of the financial data through physical and organic integration in the management process, so that standard data storage is formed, and the data management efficiency is comprehensively improved; one part of the data integrated in a data warehouse management mode is business data, and most of the data are text data which are not easy to change and are directly stored in a database of a storage layer; and for data processing requiring certain calculation power, the data processing is transmitted to a cluster of an operation layer, the cluster performs big data analysis statistics, real-time data processing, data model analysis and the like, the structured data design of a processing result is stored in a relational database, and information such as images and the like is stored in a non-relational database. During data processing, the data is normalized and standardized according to the data management requirements, and the data is fused and integrated to form standard and standard data which is stored in a formal database.

5. The invention dynamically displays the incidence relation between metadata and shows the value provided by the service between data by data visualization data governance and analyzing the characteristics between data, thereby providing a basis for data multidimensional analysis and intelligent analysis.

6. Through data analysis, the existing deep learning network model has high requirements on data analysis and calculation capacity, the mainly used network model comprises a convolutional neural network and a text information processing network, and the network model complexity of the convolutional neural network and the pointer network is mainly analyzed.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is an overall step diagram of a method for implementing visualization oriented to financial big data analysis according to the present invention;

FIG. 2 is a schematic diagram of the step of collecting initial data of step S01 in FIG. 1;

fig. 3 is a system configuration diagram of data quality management in step S02 of fig. 1;

FIG. 4 is a flowchart of the data normalization in step S03 of FIG. 1;

FIG. 5 is a schematic flow chart of the database management in step S04 of FIG. 1;

FIG. 6 is a detailed flow diagram of metadata collection, metadata management, and metadata analysis of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-6, a method for implementing visualization oriented to financial big data analysis according to the present invention includes the following steps:

Further, in the step S01, the step of acquiring initial data includes:

t04, interacting with a quality retrieval system;

the expression standardization comprises the steps of applying standardization to a data model, judging the compliance of expressions, defining the expressions, constructing a standard expression dictionary and then constructing a standard word dictionary;

at present, two data standardization constructions are adopted in the enterprise informatization process: full standardization and progressive standardization. Comprehensive standardization firstly implements an independent and comprehensive data standardization project, can basically complete the work of 'Information Resource Planning (IRP)' in the whole enterprise range, establishes a long-term stable subject database system, and builds each subsystem on the basis of the stable 'information resource platform';

the progressive standardization firstly establishes a data standardization framework of an enterprise, completes the standardization work of business data and part of management data related to the test point subsystem in cooperation with the operation of the test point subsystem, then completes the related data standardization work of each subsystem project on the premise of following the unified principle, and brings the standardization results into an enterprise data resource platform. Generally, the data standardization system is built in a progressive mode, a data standardization process and an information project building process are carried out synchronously, the building speed is guaranteed, meanwhile, a standardization principle is adhered to, the full sharing of enterprise information resources and the integration of subsystems are supported, the speed and standard are combined, meanwhile, the practicability of data standardization is guaranteed, and data standardization is prevented from being empty or flowing in a form.

Further, in the step S04, part of the data in the database management is service data, and most of the data is text data that is not easy to change and is directly stored in the database of the storage layer; and for data processing requiring certain calculation power, the data processing is transmitted to a cluster of an operation layer, the cluster performs big data analysis statistics, real-time data processing, data model analysis and the like, the structured data design of a processing result is stored in a relational database, and information such as images and the like is stored in a non-relational database. During data processing, the data is normalized and standardized according to the data management requirements, and the data is fused and integrated to form standard and standard data which is stored in a formal database;

(1) data extraction (Extract)

This part requires a lot of work to be done in the investigation phase. Firstly, it is clear that data comes from several business systems; the database server of each business system runs what DBMS; whether manual data exists; how large the amount of manual data is; whether unstructured data exists or not, and the like, and the data extraction can be designed after the information is collected.

a. Processing method for data source same as database system storing DW

This type of data source is relatively easy to design. Generally, the DBMS (SQLServer, Oracle) provides a database linking function, and a direct link relationship is established between the DW database server and the original business system, so that a Select statement can be written for direct access.

b. Processing method for data source different from DW database system

For this kind of data source, it is also possible to establish a database link in ODBC mode, such as between SQL Server and Oracle. If a database link cannot be established, this can be done in two ways, one is by means of a tool exporting the source data as txt or xls files and then importing these source system files into the ODS. Another approach is through a program interface.

c. For a file type data source (. txt,. xls), business personnel may be trained to import the data into a specified database using a database tool and then extract the data from the specified database.

d. Problem of incremental updates

For systems with large amounts of data, incremental decimation must be considered. Typically, the service system will record the time at which the service occurred, which we can use to mark the increment. The maximum time recorded in the ODS is first judged before each extraction, and then all records more than this time are extracted to the service system based on this time.

(2) Cleaning conversion of data (clearing, Transform)

Generally, the data warehouse is divided into ODS and DW. The general method is to clean the service system to the ODS, filter dirty data and incomplete data, convert the dirty data and incomplete data in the process from the ODS to the DW, and perform calculation and aggregation of some service rules.

a. Data cleansing

The task of data cleaning is to filter the data which do not meet the requirements, and the filtered result is sent to a business administration department to confirm whether the data are filtered or corrected by a business unit and then extracted.

The data which is not qualified is mainly three categories of incomplete data, error data and repeated data.

Incomplete data: this kind of data is mainly a problem of missing necessary information, such as name of supplier, name of branch company, missing regional information of customer, failure of matching between main and detail tables in business system, etc. And filtering the data, respectively writing different Excel files according to the missing contents, submitting the Excel files to the client, requiring completion within a specified time, and writing the Excel files into a data warehouse after completion.

Erroneous data: the error is caused by that a service system is not sound enough, and the data is directly written into a background database without judgment after receiving input, for example, numerical data is input into full-angle numerical characters, a carriage return operation is carried out after character string data, a date format is incorrect, a date is out of range, and the like. At the same time, this type of data also needs to be classified. For the problems similar to the existence of full-angle characters and invisible characters before and after data, the problems can be found only by writing SQL sentences, and then a client is required to extract after the business system is corrected. The errors of incorrect date format or overrange date can cause the failure of the data warehouse operation, and the errors need to be selected by the business system database in an SQL mode, are submitted to business administration departments to require correction in a limited period, and are extracted after correction.

Data for repetition: for this type of data, which may occur in particular in dimension tables, all fields of the duplicate data records are exported, passed along with customer confirmations and collated.

Data cleansing is an iterative process, a process that requires constant correction of problems, and is difficult to accomplish in a short amount of time. And customer confirmation is generally required for whether to filter or not and whether to correct. For the filtered data, writing the filtered data into an Excel file or writing the filtered data into a data table, and sending mails of the filtered data to business units every day at the initial stage of data warehouse development to prompt the business units to correct errors as soon as possible, and meanwhile, the filtered data can also be used as the basis for verifying data in the future. Data cleansing requires care not to filter out useful data, to verify carefully for each filtering rule, and to confirm by the user.

b. Data conversion

The task of data transformation is mainly to perform inconsistent data transformation, transformation of data granularity, and computation of some rules.

Inconsistent data transformation: the process is an integrated process, and unifies the same type of data of different business systems, for example, the code of the same supplier in the settlement system is XX0001, and the code in CRM is YY0001, so that the data are unified and converted into one code after being extracted.

Conversion of data granularity: business systems typically store very detailed data, and the data in the data warehouse is used for analysis and need not be very detailed. Typically, business system data is aggregated at a data warehouse granularity.

And (3) calculating a business rule: different enterprises have different business rules and different data indexes, and the indexes can not be completed by simple operation sometimes. In this case, the data indexes need to be calculated in a data warehouse and then stored in the data warehouse for analysis.

(3) Data warehouse log and alert sending

a. Data warehouse logs

Data warehouse logs are divided into three categories:

one type is an execution process log. This partial log records each step performed during the execution of the data warehouse: and recording the starting time of each step of each operation, and influencing the data of the rows, the flow accounting form and the like.

One type is an error log. When a certain module has errors, the time of each error, the module with the errors, the information of the errors and the like are recorded.

The third type of log is a global log. Only information on the start time, end time and success or not of the data warehouse is recorded. If a data warehouse tool is used, the data warehouse tool automatically generates logs, which may also be part of the data warehouse log.

The purpose of recording the log is to know the operation condition of the data warehouse at any time and find errors in time.

b. Warning transmission

If the data warehouse is in error, not only is a data warehouse error log formed, but also a warning is sent to a system administrator. The warning sending mode is various, generally, the warning sending mode is that an email is sent to a system administrator and error information is attached, so that the administrator can conveniently check errors.

2. Implementation of data warehouse

There are many ways to implement a data warehouse, three of which are commonly used. One is realized by means of data warehouse tools (such as OWB of Oracle, DTS of SQL Server 2000, SSIS service of SQL Server2005, information and the like), the other is realized in an SQL mode, and the other is realized by combining the data warehouse tools and SQL. The former two methods have advantages and disadvantages, and can quickly establish data warehouse engineering by means of tools, shield complex coding tasks, effectively improve operation efficiency, reduce implementation difficulty, but lack flexibility. The SQL method has the advantages of being flexible enough, capable of improving the operation efficiency of the data warehouse, complex in coding and high in technical requirement. And the third is to combine the advantages of the first two, which can greatly improve the development speed and efficiency of the data warehouse.

Further, in the step S05, data visualization relies on data governance, by analyzing characteristics between data, association between metadata is dynamically displayed, a value provided by a service between data is indicated, a basis is provided for multidimensional analysis and intelligent analysis of data, and high-dimensional visualization components are used to implement the data visualization, including but not limited to a trend graph, a radar graph, a 3D scatter diagram, a network diagram, a hierarchical diagram, and a word cloud diagram; according to the execution conditions of various data sources and data acquisition tasks and the backup condition of system data, after basic level data are subjected to data management, data are acquired from a background through a front-end h5 technology, the data are processed and analyzed, and results are returned by using JSONObjects. The front end acquires the result returned by the background for judgment, then analyzes the result and decides to enter the last link of the background. Finally, displaying relevant information of the data extraction task to an interface through an Echarts visualization technology;

wherein, the trend graph is as follows: what can also be called a statistical chart is a graph that presents the development trend of something or some information data in the way of presenting the statistical chart. Which is used to display measurements taken over a time interval, such as a day, week or month. The measured quantities are plotted on the vertical axis and time on the horizontal axis. It resembles a changing scoreboard. Its primary use is to determine whether there are significant temporal patterns for various types of problems. The cause thereof can be investigated. Example (b)

Radar chart: also called as a cloth drawing and a cockscomb net drawing, is one of the analysis report forms. The user can clearly understand the variation and trend of each index by dividing the important items into a circular fixed table to represent the ratio.

A scatter diagram: the sequence is displayed as a set of points. The values are represented by the positions of the points in the graph. Categories are represented by different labels in the chart. Scatter plots are typically used to compare aggregated data across categories.

Word cloud picture: is a visual description of keywords used to summarize user-generated labels or the textual content of a website. The tags are generally independent words and are often arranged in alphabetical order, and the importance degree of the tags can be expressed by changing the font size or the color, so that the tag cloud can flexibly search one tag according to the word order or the hot degree. Most tags are themselves hyperlinks, pointing directly to a series of entries associated with the tag.

Bubble diagram: the data arranged in the columns of the worksheet (x values listed in the first column and corresponding y values and bubble size values listed in the adjacent columns) may be plotted in a bubble map.

In addition, the high-dimensional visual component integrates the display modes of various components, and is displayed in a superposition manner, so that data can be displayed comprehensively from all angles, all dimensions and multiple directions; .

Further, the data analysis of step S06 adopts a model of time complexity including a single convolutional layer, a model of space complexity including an access amount, a training/prediction time determined by the time complexity, and a pointer network model; the pointer network model comprises a basic coder decoder model, an attention mechanism coder decoder model, a self-attention mechanism coder decoder model and a constructed text abstract model;

the data analysis and calculation aspect is mainly that the deep learning network model has high requirements on data analysis and calculation capacity. The network models mainly used include a convolutional neural network and a text information processing network. The complexity of the network model of the convolutional neural network and the pointer network is mainly analyzed.

The convolutional neural network is a commonly used network model in deep learning, and mainly analyzes the time complexity, the space complexity and the influence on the model in three parties;

time complexity

The time complexity of a single convolutional layer, i.e., the number of model operations, can be measured by FLOPs, i.e., the number of floating-point operations.

Time～o(M²*K²*C_in*C_out)

M side lengths of each convolution kernel output Feature Map (Feature Map)

K side lengths per convolution Kernel (Kernel)

The number of channels per convolution kernel for Cin, i.e. the number of input channels, i.e. the number of output channels of the previous layer.

The number of convolution kernels that Cout convolution layer has, that is, the number of output channels.

It can be seen that the time complexity of each convolution layer is completely determined by the output feature map area M ^2, the convolution kernel area K ^2, the input Cin and the output channel number Cout.

Where the output eigenmap size itself is determined by the four parameters of input matrix size X, convolution kernel size K, Padding, Stride, and is expressed as follows:

time complexity of convolutional neural network as a whole

This is the existing, fundamental principle of computing temporal complexity, not an algorithm, and is mainly used to compute the temporal complexity for data analysis calculations, which determines the training/prediction time of the model. If the complexity is too high, a lot of time is consumed for model training and prediction, and the idea cannot be quickly verified and the model cannot be improved, and the quick prediction cannot be realized.

The number of convolutional layers the D neural network has, i.e., the depth of the network.

The first convolutional layer of the neural network

The output channel number Cout of the first convolution layer of the Cl neural network is also the number of convolution kernels of the first convolution layer. For the l-th convolutional layer, the input channel number Cin is the output channel number of the (l-1) -th convolutional layer.

It can be seen that the overall time complexity of the CNN is not mysterious, but only the time complexity of all convolutional layers is accumulated.

Intra-layer multiplication and inter-layer accumulation.

Spatial complexity

Space complexity (inventory amount), including two parts: total parameter number + output characteristic diagram of each layer. Parameter amount: the sum of the weight parameters of all layers with parameters of the model (i.e. the model volume, first summation expression)

Characteristic map: the output characteristic graph size calculated by each layer in the real-time operation process of the model (the second summation expression of the following formula)

The total parameter number is dependent only on the size K of the convolution kernel, the number of channels C, the number of layers D, and is independent of the size of the input data.

The space occupation of the output profile is relatively easy, namely the multiplication of the space size M2 and the channel number C.

Effect of complexity on model

The temporal complexity determines the training/prediction time of the model. If the complexity is too high, a lot of time is consumed for model training and prediction, and the idea cannot be quickly verified and the model cannot be improved, and the quick prediction cannot be realized.

The spatial complexity determines the number of parameters of the model. Because of the limitation of dimension cursing, the more parameters of the model, the larger the amount of data needed to train the model, while the real-life data set is usually not too large, which results in easier overfitting of the model.

When a model needs to be clipped, since the spatial size of the convolution kernel is usually very small (3 × 3), and the depth of the network is closely related to the characterization capability of the model, excessive clipping is not suitable, so that the first place for clipping the model is usually the number of channels.

Pointer network

The pointer network model training takes 6 days and 14 hours on a 50k size dictionary and 8 days and 20 hours with a 150k size dictionary to achieve the best results. The training time of the pointer model is shorter, the model achieves convergence training for 60 ten thousand times, and the time duration is 6 days and 7 hours. Pointer models can learn faster during the early stages of training. In the last 3 ten thousand sessions of training, it took approximately 8 hours.

The pointer network is subjected to encoding and decoding operations of information during training.

(1) Basic codec model

The input sentence is read by one cyclic neural network, the whole sentence is compressed into a vector with fixed dimension, and the vector is read by the other cyclic neural network and decoded into an output sentence of the target language. These two recurrent neural networks are called an Encoder (Encoder) and a Decoder (Decoder), respectively, and the seqtoseq model is also called an Encoder-Decoder model. The model structure diagram of the encoder and decoder is shown as the following diagram: A. b, C, D represent the word vectors of the words input to the network model, h₁，h₂，h₃，h₄Representing the hidden layer output versus the hidden state, s, of the encoder, respectively₁，s₂，s₃，s₄Respectively representing the hidden state of the decoder, X, Y, Z being the prediction output of the decoder.

(2) Encoder decoder model of attention mechanism

In the seqtoseq model, the encoder integrates the input of the complete sentence into a fixed-dimension vector, and inputs the vector into the decoder, which predicts the output sentence according to the vector. However, when the input sentence is long, it is difficult for the fixed-dimension intermediate vector to store enough information, which becomes a bottleneck of the basic seqtoseq model. An Attention mechanism (Attention) model is proposed to address this problem. The attention mechanism allows the decoder to look at the words or segments of the input sentence in the encoder at any time, thus eliminating the need for intermediate vectorsAll information is stored. The model structure diagram of attention mechanism is shown in the figure, wherein A, B, C, D and E respectively represent word vectors input into the hidden layer, the rectangular block represents the hidden layer of the bidirectional cyclic neural network, h₀，h₁，h₂，h₃Respectively, representing the output state of the hidden layer of the encoder. t is t_j-1，t_j，t_j+1The rectangular blocks also represent hidden layers, and besides the output of the previous moment, the hidden state output by the hidden unit at the previous moment is also used as the input of the hidden unit at the next moment.

The decoder takes the hidden state as the input of the query in each decoding step, inputs the hidden state into the encoder to query the hidden state of the encoder, calculates a weight of the degree of correlation with the query at each input position, then calculates a weighted average of the hidden state of each input position according to the weight, and obtains a vector called a context vector after weighted average, which represents the original text information most relevant to the currently output word. When the next word is decoded, the context vector is input as additional information into the recurrent neural network of the decoder, so that the decoder can read the most relevant original information to the current output at any time without completely depending on the hidden state at the previous moment.

Mathematical definition of attention mechanism according to the calculation formula in the attention mechanism model:

a^t＝softmax(e^t)

the algorithm is mainly used for semantic extraction, and can be used for cleaning and classifying raw data in the system, and text information which does not meet the standard is formatted into metadata information which meets the standard for subsequent management.

Wherein

Is a function for calculating the degree of correlation between each word in the original text and the current decoding state, and is most commonly used

Is a recurrent neural network with a single hidden layer. Calculating the weight a by softmax function^tComputing the context vector by weighted averaging

This context vector can be thought of as a vector of information read from the original text at each step of fixed dimensional size.

From the principle of attention mechanism, it is known that the text information input to the encoder is converted into word vectors and input to the hidden layer, and the output state of the hidden layer is h_i。a_iWeight information representing the hidden state of the encoder and the hidden state of the hidden layer output of the decoder. t is t_j-1The output information representing decoder time j-1 is also taken as the output of the decoder at time j. It can be seen from the internal details of the conventional mechanism that after the context vector of the j-th step is calculated, the context vector is added to the j +1 time as the input of the next time, and the decoder can query the most relevant information to the original text at each decoding step through the context vector, so that the information bottleneck problem of the basic seqtoseq model is avoided.

It can be seen from the figure that the encoder with the added attention mechanism adopts a bidirectional recurrent neural network, and it is very important to select the bidirectional recurrent neural network in the attention mechanism model, because when the decoder predicts a word through the attention mechanism, it usually needs to know part of the information around the word, and if a single recurrent neural network is used, the hidden state of each word only contains the literal information on the left of the word and does not contain the information on the right of the word. The use of the bidirectional recurrent neural network can enable the hidden state of the word to contain information on the left side and the right side at the same time.

The attention mechanism is used for eliminating the connection between the encoder and the decoder besides the bidirectional recurrent neural network, and the decoder completely depends on the attention mechanism to acquire the original text information, so that the encoder and the decoder can freely select the neural network model. The attention mechanism is an efficient way to obtain information, and allows the decoder to query the most relevant information input at each step, thereby greatly shortening the distance over which information flows.

(3) Self-attention mechanism encoder decoder model

The encoder decoder model of the basic attention mechanism can enable a decoder to extract semantic features of input information through internal calculation of the attention mechanism, but the extracted semantic features are not strong, and the relevance between the interiors of the input information cannot be extracted, so that the generated text abstract is not strong, in order to better extract the relevant semantic information between the interiors of the input text information, the self-attention mechanism is introduced to generate a news abstract, and the purpose of better semantic extraction is expected to be achieved.

(4) Constructing text abstract model

The hidden layer output of the text abstract model coder introduced with the self-attention mechanism is different from that of the basic coder decoder model added with the attention mechanism, and the self-attention mechanism calculates the correlation degree between each word in the input sequence and extracts more semantic information. The structure diagram of the self-attention mechanism is shown in the following figure:

therefore, in order to better perform relevant calculation analysis on the acquired data and upgrade the algorithm and the model, a corresponding calculation server needs to be added for support on the basis of the first-stage construction.

As shown in fig. 6:

and (3) system access flow: the data source is put into a metadata subsystem according to the obtained metadata after data extraction-storage into a data platform, and a developer can inquire the related metadata through the system. Data change influence evaluation can be applied to the data service system, the metadata subsystem performs influence analysis on the data change, and the influence is fed back to the data service system through a metadata manager. Problems with downstream systems are located by blood margin analysis.

And analyzing the association degree of ancestry, influence, whole chain and table: the method comprises the steps of obtaining data from different data sources, then converting the data according to requirements, converting the extracted data according to a pre-designed rule, cleaning out unnecessary data or unavailable data, unifying originally heterogeneous data formats, combining metadata of the data, and performing statistical analysis to obtain regular conclusive analysis.

And (3) analyzing the ancestry: the cross-tool knows the source and destination of the flow variation of data in the system.

Influence analysis: enterprise-wide data change impact is tracked across tools.

The system provides lifecycle management of metadata and version management functions, which ensure the quality of metadata and the authority and reliability of subsequent use of the metadata system.

The overall technical scheme of the invention is as follows:

1. the acquisition of initial data is performed using some data interface.

2. The data is checked according to different required standards to ensure the availability and uniqueness of the data.

3. And extracting metadata, and summarizing and sorting the metadata, such as formatting the metadata on data or adding metadata similar to tags to the data to facilitate subsequent data processing.

4. Later, by comparing all the values of the same metadata of a large amount of information and carrying out statistics, regular conclusions can be obtained.

5. Specific data relationships are searched according to requirements, and the data relationships can be visually represented by using some visualization front-end technologies, so that the acceptability of the data is enhanced.

And performing management such as maintenance, retrieval, updating, life cycle management, metadata import and export on the financial metadata. The metadata maintenance mainly refers to query modification and deletion of basic information, attributes, association and other information of the financial attribute data. The data retrieval is to screen metadata which accords with the function authority of accessing the financial data according to the search condition. The metadata self-defined multi-dimensional analysis and derivation mainly provides data derivation of the financial system metadata and analysis results; exporting all data needing to be exported from the financial metadata tree to a file; analysis results mainly have influence analysis, pedigree analysis, data warehouse mapping analysis and the like, and support data export and picture export functions. The version management has strict processes of whole life cycle management, release, deletion and state change of financial data, provides data life cycle management, ensures the quality of the stable data and ensures the reliability of a subsequent metadata using system.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A method for realizing financial big data analysis-oriented visualization is characterized by comprising the following steps:

2. The method for implementing visualization oriented to financial big data analysis according to claim 1, wherein in the step of S01, the step of collecting initial data includes:

t04, interacting with a quality retrieval system;

3. The method for implementing visualization oriented to financial big data analysis according to claim 1, wherein the data quality index manages checking rules of six typical data quality problems according to eight major element specifications of data quality, and comprises: non-null checking, uniqueness checking, main foreign key checking, length checking, code checking and consistency checking; the data quality check comprises eight major elements: integrity, legality, uniqueness, consistency, accuracy, timeliness, safety and expansibility;

4. The method for implementing visualization oriented to financial big data analysis as claimed in claim 1, wherein the selecting of standardized objects in the step of S03 includes the steps of: specifying a detailed execution plan, determining standardized principles, defining standardized guidelines including naming rules, selecting standardized objects, collecting source data;

5. The method for implementing visualization oriented to financial big data analysis according to claim 1, wherein the step of extracting data in database management in step S04 is as follows: firstly, it is clear that data comes from several business systems; the database server of each business system runs what DBMS; whether manual data exists; how large the amount of manual data is; whether unstructured data exist or not, and the like, and the design of data extraction can be carried out after the information is collected, wherein the design comprises the following steps:

6. The method for implementing visualization oriented to financial big data analysis according to claim 1, wherein the cleansing of data in database management in step S04 is used for filtering those data which are not qualified, and includes: incomplete data, erroneous data, repeated data; the filtered result is sent to a business administration department, whether the filtered result is filtered or corrected by a business unit is determined, and then the filtered result is extracted;

7. The method for implementing visualization oriented to financial big data analysis as claimed in claim 1, wherein the data visualization in step S05 is implemented by high-dimensional visualization components including but not limited to trend graph, radar graph, 3D scatter diagram, network diagram, hierarchy diagram, word cloud diagram.

8. The method for implementing visualization oriented to financial big data analysis of claim 1, wherein the data analysis of step S06 employs a time complexity model including a single convolutional layer, a space complexity model including an access amount, a training/prediction time of the time complexity-determined model, a pointer network model; the pointer network model comprises a basic coder decoder model, an attention mechanism coder decoder model, a self-attention mechanism coder decoder model and a constructed text abstract model.