CN117762900A - Data modeling method capable of improving calculation performance of risk screening system of publisher - Google Patents

Data modeling method capable of improving calculation performance of risk screening system of publisher Download PDF

Info

Publication number
CN117762900A
CN117762900A CN202311812725.6A CN202311812725A CN117762900A CN 117762900 A CN117762900 A CN 117762900A CN 202311812725 A CN202311812725 A CN 202311812725A CN 117762900 A CN117762900 A CN 117762900A
Authority
CN
China
Prior art keywords
data
enterprise
database
risk
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311812725.6A
Other languages
Chinese (zh)
Inventor
吴哲锐
楼轶川
李姣
刘林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minsheng Securities Co ltd
Original Assignee
Minsheng Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minsheng Securities Co ltd filed Critical Minsheng Securities Co ltd
Priority to CN202311812725.6A priority Critical patent/CN117762900A/en
Publication of CN117762900A publication Critical patent/CN117762900A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application belongs to the technical field of big data, and in particular relates to a data modeling method capable of improving the calculation performance of a risk screening system of a publisher, which comprises the following steps: s1, analyzing data requirements, and determining a data object relation and a clear risk screening path of industrial and commercial enterprise data; s2, selecting a data architecture, S3, constructing a data model, constructing a traditional star-shaped dimension model, and then constructing a high-coupling data model; after the high coupling data model is built, the modeling is performed again according to the use scene, and finally, the table structure optimization is performed by combining with the characteristics of the HTAP database. The data model construction method provided by the invention is a novel modeling method which is formed by using an HTAP architecture database Greemplum and a complex nested data structure provided by the database and fusing with a traditional star-shaped dimension model construction theory of a data warehouse, and the modeling method is applied to a release risk screening scene project, can be used for rapidly searching an associated party enterprise and rapidly calculating related risk data of the associated party enterprise.

Description

Data modeling method capable of improving calculation performance of risk screening system of publisher
Technical Field
The invention relates to the technical field of big data, in particular to a data modeling method capable of improving the calculation performance of a risk screening system of a publisher.
Background
The risk screening of the distributor in the projection business needs to quickly search the association relationship of enterprises from mass industrial and commercial enterprise data and calculate the associated transaction risk data. The data volume of the industrial and commercial enterprise data in China is very large, and the industrial and commercial enterprises comprise more than six millions of enterprises, the total data of related stakeholders, dong Jiangao, legal representatives, subsidiary companies and the like is more than billions, and the storage capacity is more than 400GB. In order to improve the working efficiency and user experience of screening the risks of the publishers by the personnel of the operation staff, the used risk screening system needs to be capable of quickly checking all the associated party data of any enterprise, and meanwhile, the calculation of the six related risk indexes of all the associated party enterprises is completed in a short time, so that high requirements are put forward on the query speed and the calculation performance of the data model used by the risk screening system.
The current common data modeling method mainly comprises two kinds of normal form modeling and star-shaped dimension modeling. The model modeling is mainly applied to application type scenes of background management type information systems of various types of toC and toB, and is a preferred modeling scheme of most background technologies. It mainly uses a conventional RDBMS database (e.g. MySQL, oracle, etc.) as a data carrier. The traditional RDBMS has the technical characteristics that the transaction characteristics of ACID and the like can be ensured, meanwhile, the index of the B-tree data structure is utilized, redundancy is eliminated by utilizing normal form modeling, and on the premise of defining the object relationship, quick query and small amount of data insertion can be realized at the millisecond level. However, since the theoretical goal of the normal form modeling is to reduce redundancy and clear the relationship between data objects, when the normal form model is queried by using the SQL, multiple database tables are often required to be associated, which is still enough in the case that the data volume of the SQL statement related to the operation is small, but when the SQL relates to a large amount of data calculation, for example, aggregation, comparison, recursive calculation and the like which relate to tens of thousands to hundreds of thousands of pieces of data, the traditional RDBMS database is not optimized for the scenes, and the normal form model also aggravates the performance problem of the scenes, so that in the risk screening scenes of the publisher which relate to the calculation and analysis of large data volume, the normal form modeling method based on the traditional RDBMS cannot meet the performance requirement of the calculation of the risk screening factor.
The star-shaped dimension model is mainly applied to scenes such as data calculation, analysis, operation and the like of a mass data set, and generally adopts a distributed database or a distributed big data processing frame (such as a Hadoop and other technical frames), and has the technical characteristics that: the data source data object (namely, the model data object) is analyzed, and then is remodelled according to the relation of facts and dimensions. The modeling method adopts a technical framework which is usually in column type storage, and the processes of inquiring, calculating and analyzing mass data are quickened through the compression of data and the characteristics of a distributed system. In risk screening projects, all the associated parties to the issuer need to be quickly queried first in the business enterprise data, and the data volume involved is mostly thousands to tens of thousands. In this scenario, when a small amount of data needs to be screened from the massive data, reasonable performance cannot be obtained because the common distributed technology framework has no index. Meanwhile, the star-shaped dimension model needs to be established under the condition of determining facts and dimensions, but industrial and commercial enterprise data is characterized in that no obvious pipelining type facts exist, and most of the star-shaped dimension model is 1-to-many slowly-changing main data, if the star-shaped dimension model is simply adopted, the association between large dimension tables still cannot be avoided, namely the calculation performance of the star-shaped dimension model still cannot meet the calculation requirement of risk screening factors.
In summary, since the publisher risk screening is characterized by the fact that the query and screened scenario approximates a traditional background application scenario, the calculation of risk factors is biased towards a data calculation class scenario. Modeling industrial and commercial enterprise data by adopting a conventional database three-range modeling method, and using a relational database such as Oracle as a modeling scheme of a data carrier, wherein the calculation performance requirement of a user on risk screening factors cannot be met; the modeling method of the star-shaped dimension model of the data warehouse is simply used, and the performance requirement of rapidly screening enterprise associated parties cannot be met.
Therefore, it is needed to provide a novel data modeling method for risk screening of a publisher, and a risk screening system modeled by the method can be used for searching thousands to tens of thousands of associated parties of any enterprise within 1-3 seconds, and simultaneously, the calculation of relevant risk indexes of all the associated party enterprises of the enterprise can be completed within 30 seconds.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a data modeling method capable of improving the calculation performance of a risk screening system of a publisher, which can store massive industrial and commercial enterprise data and can quickly inquire related risk data of a related party enterprise and a related party enterprise.
The technical scheme adopted by the invention is as follows: a data modeling method capable of improving the calculation performance of a publisher risk screening system comprises the following steps:
s1, data demand analysis
The data demand analysis mainly completes two works, namely, the data object relation of the industrial and commercial enterprise data and the path of clear risk screening are clear.
S11, defining data object relation of industrial and commercial enterprise data
The business enterprise data required by the publisher risk screening project comprises about 20 types of enterprise related data including enterprise annual reports, data such as in-ying enterprises, business stakeholders, business high management, marketing companies, three-board companies, suspected actual control persons, subordinate control enterprises, A-strand stakeholders, three-board ten-large stakeholders, A-strand Dong Jiangao tening, three-board Dong Jiangao tening, three-board system constant personnel names, registered address change record forms, shutdown enterprises, suspension enterprises, cancellation enterprises, national five-level administrative division codes, full amount of business accessory information, enterprise great names and the like, and the association relation among the data needs to be cleared.
S12, clearing risk screening path
And obtaining the query relation and the path of each step in the risk screening process to the corresponding table according to the requirement of risk screening.
S121, inquiring the associated party of the issuer and basic enterprise information thereof, such as enterprise mailbox, telephone, registered capital, annual report and the like;
s122, inquiring data required by the risk factors, such as stakeholders, actual control persons, legal persons, dong Jiangao lists and the like of the associated parties, according to the acquired information of the associated parties.
S2, data architecture selection
Because both the traditional RDBMS database and the distributed database cannot meet the performance requirements related to risk screening at the same time, a database product having the advantages of both the aforementioned databases needs to be selected.
The application performs data architecture selection according to the following two principles: on the one hand, a hybrid new database architecture is required, and on the other hand, a complex nested or semi-structured data structure is required to be supported.
For the first principle, according to the data demand analysis structure of the risk screening of the issuer, we know that the first step of screening and searching the associated party of the issuer needs to find the issuer enterprise code in 6000 ten thousand enterprise data, and find the associated party enterprise in the data tables of more than 1 hundred million stakeholders, actual control persons and the like through the enterprise code, and the query characteristic is good at the index of the RDMBS. And the risk factors after the second step are calculated, so that tens of thousands or even hundreds of thousands of risk factors are required to be aggregated to judge the risk, which is also good at the distributed MPP database. Therefore, a database with both features, i.e. the now emerging hybrid database (HTAP), is required to be selected, which is now known as the greenplus of Tidb, vmware of OceanBase, pingCAP in the united states.
For the second principle, according to the result of the data demand analysis stage, we can know that, because the data volume of stakeholders, dong Jiangao, annual reports and the like is too large, even if the data volume is optimized by means of indexing and the like, the query performance is still unsatisfactory in some extreme cases. It is also necessary to further couple data object relationships through complex nested or semi-structured data structures. Most of the current mainstream databases can support complex nested data structures or semi-structured data structures such as JSON. For example, oceanBase supports multimode data structures, tidb supports JSON data structures, greenplus supports custom data types and JSON data structures.
Through multiparty investigation, the invention selects the Greenplum database as a basic database after considering a plurality of indexes and compatibility. The final data structure adopts the industrial and commercial enterprise information data provided by the field of view Jin Fu (industrial and commercial data manufacturer), the data updating frequency is that incremental updating is carried out every day, and daily updating data is transmitted through the FTP protocol. The data transmission and processing tool uses an ETL tool Kettle, and the ETL tool Kettle is connected in series through an ftp acquisition component, a data processing component, a database warehouse entry component and an SQL execution component to form an acquisition link of each data table. The data warehouse stores the data floor and data processing by adopting a GreenPlum database, and simultaneously meets the demand of data extemporaneous inquiry and the demand of risk calculation. In order to quickly iterate against business demand variation, the data service adopts a service framework with a data arrangement technology, and the quick online and iterative capacity of the data service interface is completed through a visual arrangement component combining SQL codes and python codes.
S3, data model construction
The modeling method is based on the re-expansion of the traditional star-shaped dimension model, so that the modeling process is divided into two steps: a traditional star-shaped dimension model construction process and a high-coupling data model construction process. The method has the advantages that the model has good compatibility, namely, the data model constructed by the method can be downwards compatible with the star dimension model although the star dimension model is modified.
S31, construction of traditional star-shaped dimension model
The data used for risk screening is mainly divided into enterprise basic information, stockholder information, dong Jiangao information, annual report information and real control human information, so the modeling steps are as follows:
a. defining main data, and determining enterprise basic information as main data objects and stakeholders, dong Jiangao and annual reports as main data auxiliary objects;
b. merging the same subject data, and merging a plurality of database tables of main data/auxiliary objects such as enterprise basic information, stakeholders, dong Jiangao, annual report and the like;
c. and (3) the model redundancy design, wherein fields which are convenient for improving the performance, such as a latest annual report mark, are additionally added in the data processing process according to the characteristics of the service use process so as to improve the line performance.
S32, construction of high coupling data model
After the traditional star-shaped dimension model is well defined, the invention re-builds the basic model by combining the characteristics of the semi-structured data and the MMP database. Namely, a composite data object which is more convenient to query and calculate is created by combining a 1-to-N data relation through the thought of combining a binary JSON data structure and column storage.
For example, the relationship between the current date of stakeholders and Dong Jiangao and the enterprise is a simple 1-to-many relationship, that is, a latest stakeholder list and Dong Jiangao list of one enterprise are composed of a plurality of objects, so the present invention introduces the concept of a custom data structure in the basic information of the enterprise, for storing the latest stakeholder list and Dong Jiangao list.
As another example, the relationship between historical stakeholders, dong Jiangao data and an enterprise is a complex 1-to-many relationship, and an enterprise has multiple historical related stakeholders and Dong Jiangao lists, each consisting of multiple stakeholders and Dong Jiangao, so the present invention expresses and stores complex data such as historical stakeholders, dong Jiangao, etc. by introducing an Array < custom data type > complex data structure.
S33, rebuilding according to use scenes
The data model obtained by modeling according to the high coupling thought may encounter some additional problems in practical use. For example, the latest annual report information of the current enterprise is obtained, but the latest annual report information can be obtained from the coupled annual report array structure, but the latest annual report information still needs to be queried after the latest annual report period is calculated, so that the query efficiency still has room for improvement.
The reconstruction method of the usage scenario is to move the complex calculation process in the process of re-query to the offline calculation process after modeling, and the table structure completes the direct acquisition of the query by adding the required fields.
And in the modeling process, the fields in part of the data types are independently removed by analyzing the query fields and are put into an enterprise information redundancy table. And for the latest report period data, the latest current data and the historical data are divided into different fields for storage, so that repeated calculation of the latest report period in the query process is avoided.
S34.MPP database table structure optimization
The actual use scene of the invention has the use characteristics of both OLAP and OLTP, so the last step of model design is to optimize the table structure of different use scenes by combining the characteristics of the HTAP database.
In addition to the enterprise information redundant table, a table for storing and calculating risk factors is also created in the model of the invention, and the table structure of the risk factor storage table is shown in the following table 1:
TABLE 1
The usage scenario of the enterprise information redundancy table is closer to the traditional OLTP scenario, and the calculation risk factor is closer to the OLAP scenario, so that the table structural design optimization result is shown in the following table 2:
TABLE 2
Table name Index Distribution of Storage of
Enterprise information redundancy table Enterprise code Per enterprise code distribution Heap storage
Risk factor storage table Index-free Distribution by risk factor content Column storage
The invention adopts the related technology and the background:
greenplus is an open source distributed MPP database, and also has HTAP database capabilities. The HTAP database is a novel hybrid database which has the mass data aggregation analysis capability of the traditional OLAP database and the high concurrency transaction capability of the traditional OLTP database. The GreenPlum database is selected, so that the mass storage problem of industrial and commercial enterprise data is solved, and the pain points of quick searching and quick calculation are solved by utilizing the capacity of GreenPlum and the capacity of OLTP.
The star-shaped dimension model theory is a classical theory of data warehouse modeling, and the theoretical target is to solve the problem that analysis performance and business understanding are difficult in an analytical scene. Different from the traditional relational database paradigm modeling theory, the star-shaped dimension model needs to be analyzed according to a service scene and then is subjected to model construction, and is commonly used in application scenes of data analysis and data operation.
The method has the advantages that the method has no data structure of a specific preset model or complex nested data structure, and can greatly improve the flexibility and the query performance of the model on the premise of not improving the complexity of the data relationship of the database, which is a technical advantage of NoSQL type databases and semi-structured data structures such as JSON. One big characteristic of industrial and commercial enterprise data is that the association relationship between data objects is very complex, so in the modeling process by using a star-shaped dimension model, the data is required to be subjected to redundant modeling by using semi-structured data, and the loading and calculation performance of the data are accelerated so as to meet the standard of project requirements.
The invention has the beneficial effects that: the data model construction method provided by the invention is a novel modeling method which uses an HTAP architecture database Greemplum and a complex nested data structure provided by the database and is integrated with the traditional star-shaped dimension model construction theory of a data warehouse, and the modeling method is applied to release risk screening scene projects, so that the performance problem that second-level searching and large-scale data calculation are required to be realized in billions-level massive industrial enterprise data can be solved.
Drawings
FIG. 1 is a diagram of a data source E-R of business enterprise data in accordance with the present invention;
FIG. 2 is a schematic diagram of a query path of a risk screening party according to the present invention;
FIG. 3 is a schematic diagram of a path for querying risk factors according to an associated party in the present invention;
FIG. 4 is a data background technical architecture diagram of the present invention;
FIG. 5 is a schematic diagram of a star-shaped dimension model in accordance with the present invention;
FIG. 6 is a schematic diagram of a data model of the present invention after a high coupling design;
FIG. 7 is a schematic representation of a model of the present invention after re-modeling using a scene;
FIG. 8 is a graph comparing test results when the risk factors of the correspondents are extracted and calculated;
fig. 9 is a comparison graph of test results when query publishers all risk factor calculations.
Detailed Description
The present invention will be described in further detail with reference to specific examples, wherein methods or functional elements not specifically described are prior art.
Examples
As shown in fig. 1-7, the present embodiment provides a data modeling method capable of improving the computing performance of a risk screening system of a publisher, including the following steps:
s1, data demand analysis
The data demand analysis mainly completes two works, namely, the data object relation of the industrial and commercial enterprise data and the risk screening path of the clearing risk screening are clear.
S11, defining data object relation of industrial and commercial enterprise data
The business enterprise data required by the publisher risk screening project comprises about 20 types of enterprise related data including enterprise annual reports, data such as an enterprise, an industrial and commercial stakeholder, an industrial and commercial high-level management, a marketing company, a three-plate company, a suspected actual controller, a subordinate control enterprise, an A-stock stakeholder, a three-plate ten-large stakeholder, an A-stock Dong Jiangao tenn, a three-plate Dong Jiangao tenn, a three-plate system constant personnel name, a registration address change record form, a shutdown enterprise, a suspension enterprise, a cancellation enterprise, a national five-level administrative division code, full amount of business accessory information, enterprise great names and the like, the relationship of the data is shown in the figure 1, but the relationship of data source end data is more complex, the relationship between data objects (forms) is more clear, and the model of the data source is a normal form model.
S12, clearing risk screening path
And obtaining the query relation and the path of each step in the risk screening process to the table according to the requirement of risk screening.
S121, inquiring the associated party of the issuer and basic enterprise information thereof, such as enterprise mailbox, telephone, registered capital, annual report and the like; as shown in fig. 2, the content with the ground pattern is a query path.
S122, inquiring data required by the risk factors, such as stakeholders, actual control persons, legal persons, dong Jiangao lists and the like of the associated parties according to the acquired information of the associated parties, as shown in fig. 3.
S2, data architecture selection
Because both the traditional RDBMS database and the distributed database cannot meet the performance requirements related to risk screening at the same time, a database product having the advantages of both the aforementioned databases needs to be selected. The selection of the data architecture is done according to the following two principles: on the one hand, a hybrid database architecture is required, and on the other hand, complex nested or semi-structured data structures are required to be supported.
For the first principle, according to the data demand analysis structure of the risk screening of the issuer, we know that the first step of screening and searching the associated party of the issuer needs to find the issuer enterprise code in 6000 ten thousand enterprise data, and find the associated party enterprise in the data tables of more than 1 hundred million stakeholders, actual control persons and the like through the enterprise code, and the query characteristic is good at the index of the RDMBS. And the risk factors after the second step are calculated, so that tens of thousands or even hundreds of thousands of risk factors are required to be aggregated to judge the risk, which is also good at the distributed MPP database. Therefore, a database with both features, i.e. the now emerging hybrid database (HTAP), is required to be selected, which is now known as the greenplus of Tidb, vmware of OceanBase, pingCAP in the united states.
For the second principle, according to the result of the data demand analysis stage, we can know that, because the data volume of stakeholders, dong Jiangao, annual reports and the like is too large, even if the data volume is optimized by means of indexing and the like, the query performance is still unsatisfactory in some extreme cases. It is also necessary to further couple data object relationships through complex nested or semi-structured data structures. Most of the current mainstream databases can support complex nested data structures or semi-structured data structures such as JSON. For example, oceanBase supports multimode data structures, tidb supports JSON data structures, greenplus supports custom data types and JSON data structures.
Through multiparty investigation, the invention selects the Greenplum database as a basic database after considering a plurality of indexes and compatibility, and the final data architecture adopts the industrial and commercial enterprise information data provided by the field of view Jin Fu (industrial and commercial data manufacturer), the data update frequency is that incremental update is carried out every day, and daily update data is transmitted through an FTP protocol. The data transmission and processing tool uses an ETL tool Kettle, and the ETL tool Kettle is connected in series through an ftp acquisition component, a data processing component, a database warehouse entry component and an SQL execution component to form an acquisition link of each data table. The data warehouse stores the data floor and data processing by adopting a GreenPlum database, and simultaneously meets the demand of data extemporaneous inquiry and the demand of risk calculation. In order to quickly iterate against business demand variation, the data service adopts a service framework with a data arrangement technology, and the quick online and iterative capacity of the data service interface is completed through a visual arrangement component combining SQL codes and python codes. As shown in fig. 4, the data architecture finally designed and adopted by the scheme is shown.
S3, data model construction
The modeling method is based on the re-expansion of the traditional star-shaped dimension model, so that the whole modeling process can be divided into two steps: a traditional star-shaped dimension model construction process and a high-coupling data model construction process. The method has the advantages that the compatibility is good, namely, although the star-shaped dimension model is modified by the model built according to the method, the basic model can be downwards compatible with the star-shaped dimension model.
S31, construction of traditional star-shaped dimension model
As can be seen from fig. 2, the data used for risk screening is mainly divided into basic information of enterprises, stockholder information, dong Jiangao information, annual information and real-control human information, so the modeling steps are as follows:
a. defining main data, and determining enterprise basic information as main data objects and stakeholders, dong Jiangao and annual reports as main data auxiliary objects;
b. merging the same subject data, merging a plurality of database tables of main data/auxiliary objects such as enterprise basic information, stakeholders, dong Jiangao, annual report and the like;
c. and (3) the model redundancy design, wherein fields which are convenient for improving the performance, such as a latest annual report mark, are additionally added in the data processing process according to the characteristics of the service use process so as to improve the line performance.
The specific model relationship is shown in fig. 5.
S32, construction of high coupling data model
After the traditional star-shaped dimension model is well defined, the invention re-builds the basic model (called as integrating the data model in the text) by combining the characteristics of the semi-structured data and the MMP database, namely, a composite data object which is more convenient to inquire and calculate is created through the thought of combining a binary JSON data structure and column storage by 1 pair of N data relations.
For example, the relationship between the current date of stakeholders and Dong Jiangao and the enterprise is a simple 1-to-many relationship, namely, the latest stakeholder list and Dong Jiangao list of one enterprise are composed of a plurality of objects, so that the invention introduces a custom data structure concept in the basic information of the enterprise, and is used for storing the latest stakeholder list and Dong Jiangao list; as another example, the relationship between historical stakeholders, dong Jiangao data and an enterprise is a complex 1-to-many relationship, and an enterprise has multiple historical related stakeholders and Dong Jiangao lists, each consisting of multiple stakeholders and Dong Jiangao, so the present invention expresses and stores complex data such as historical stakeholders, dong Jiangao, etc. by introducing an Array < custom data type > complex data structure.
A model of the preliminary design based on high coupling is shown in fig. 6.
S33, rebuilding according to use scenes
The data model obtained by modeling according to the high coupling thought may encounter some additional problems in practical use. For example, the latest annual report information of the current enterprise is obtained, but the latest annual report information can be obtained from the coupled annual report array structure, but the latest annual report information still needs to be queried after the latest annual report period is calculated, so that the query efficiency still has room for improvement.
The reconstruction method of the scene is to move the complex calculation process in the query process to the off-line calculation process after modeling, and the table structure completes the direct acquisition of the query result by adding the required fields.
And in the modeling process, the fields in part of the data types are independently removed by analyzing the query fields and are put into an enterprise information redundancy table. And for the latest report period data, the latest current data and the historical data are divided into different fields for storage, so that repeated calculation of the latest report period in the query process is avoided.
An example of a model after re-modeling by using a scene is shown in fig. 7.
S34.MPP database table structure optimization
The actual use scene of the invention has the use characteristics of OLAP and OLTP, so the last step of model design is also aimed at different use scenes and the table structure optimization is carried out by combining the characteristics of the HTAP database.
In the model of the present invention, in addition to enterprise information redundancy, a table for storing and calculating risk factors is created, and the table structure of the risk factor storage table is as follows in table 1:
TABLE 1
The usage scenario of the enterprise information redundancy table is closer to the traditional OLTP scenario, and the calculation risk factor is closer to the OLAP scenario, so that the table structural design optimization result is shown in the following table 2:
TABLE 2
Table name Index Distribution of Storage of
Enterprise information redundancy table Enterprise code Per enterprise code distribution Heap storage
Risk factor storage table Index-free Distribution by risk factor content Column storage
Performance testing
The applicant performs performance test of specific cases, and compares the modeling method with the method of directly inquiring the source enterprise database by using the Oracle database without adopting any modeling to demonstrate the technical innovation and the advancement of the method.
1. The test environment is shown in table 3 below.
TABLE 3 Table 3
Original scheme The scheme of the invention
Database for storing data Oracle GreenPlum
Hardware node 2 node RAC mode 2 Master4 segments
Operating system Linux Linux
2. Overview of test scenario
The application scene of the invention is a calculation process of risk factors of a line risk screening project, which can be roughly divided into several steps:
1) Extracting and querying the correspondents enterprise by the distributor
2) Querying basic information of associated party
3) Extracting risk factors of the related parties and calculating
4) Query issuer all risk factor calculation results
In the steps 1) and 2), the related party information of the issuer is only related, so the amount of query data is small, and the scenario is that only huge domestic enterprises and large asset management parties are related to the related party of the issuer, so that the slow query is sporadically performed. But 3) and 4) need to extract tens of thousands or even hundreds of thousands of risk factors from thousands of associated parties to perform risk calculation simultaneously, so the test scenario is mainly performed around 3) and 4).
3. Test procedure and results
The test is carried out by adopting actual data of real release items, the data samples are 47 release items, and meanwhile, the extraction of the associated party is completed. The relevant data are shown in table 4 below:
TABLE 4 Table 4
Number of items Maximum number of item-associated parties Minimum number of project correspondents Average number of item-related parties
47 9107 2885 1057
The performance of the screening system under the following two models was tested separately:
1) Modeling is not performed by the method, and calculation is directly performed through source data;
2) After modeling by using the method, risk factor calculation and query are carried out.
In the step of extracting and calculating the risk factors of the associated parties, the test results of the two models are shown in the following table 5 and fig. 8:
TABLE 5
In the step of querying all risk factor calculation results of the publisher, the test results of the two models are shown in the following table 6 and fig. 9:
TABLE 6
The test results are summarized below:
by comparing the tables of test results, it can be found that the maximum extremum is not more than 4 seconds after the present invention is used, the average calculation speed is maintained at about 1.4 seconds, while the average calculation time is nearly 3 times that of the method without using the present invention, and the calculation time is more than 10 seconds for up to 6 times without using the test result of the present invention. The query of the risk factors benefits from the model of the invention, the average query time is improved by more than 10 times, and the maximum value after optimization does not exceed 1 second. This fully demonstrates the superiority of the invention, with a statistically significant improvement.
The above is only a part of embodiments of the present invention, and it is not intended to limit the present invention, and it is obvious to those skilled in the art that the present invention can be combined and modified in various technical features, and it is intended to include the present invention in the scope of the present invention without departing from the spirit and scope of the present invention.

Claims (10)

1. A data modeling method capable of improving the calculation performance of a publisher risk screening system comprises the following steps:
s1, data demand analysis
The data demand analysis completes two works, and the data object relation and clear risk screening path of the business enterprise data are defined;
s2, data architecture selection
The selection of the data architecture is done according to the following two principles: on one hand, a hybrid novel database architecture is needed, and on the other hand, a complex nested or semi-structured data structure is needed to be supported;
s3, data model construction
The method comprises a traditional star-shaped dimension model construction process and a high-coupling data model construction process; the construction process of the traditional star-shaped dimension model comprises the following steps:
a. defining main data, and determining enterprise basic information as main data objects and stakeholders, dong Jiangao and annual reports as main data auxiliary objects;
b. merging the same subject data, and merging a plurality of database tables of main data/auxiliary objects such as enterprise basic information, stakeholders, dong Jiangao, annual report and the like;
c. the redundancy design of the model, according to the characteristics of the service using process, additionally adding a field which is convenient for improving the performance in the data processing process;
in the process of constructing a high-coupling data model, a 1-to-N data relationship is combined with a train of thought through a binary JSON data structure, so that a composite data object which is more convenient to query and calculate is created;
after the high coupling data model is built, the reconstruction method of the usage scene is that a complex calculation process in the query process is moved to an off-line calculation process after modeling, and the table structure completes the direct acquisition of the query by adding required fields;
and (3) carrying out table structure optimization by combining the characteristics of the HTAP database according to different use scenes.
2. The data modeling method of claim 1, wherein in step S1, clearing a risk screening path comprises two steps: the method comprises the steps of firstly, inquiring the association party of an issuer and basic enterprise information thereof, and secondly, inquiring data required by risk factors according to the obtained information of the association party.
3. The data modeling method of claim 1, wherein in step S2, a greenplus database is selected as the base database.
4. The data modeling method of claim 1, wherein a field of a latest annual report flag is added in the model redundancy design.
5. The data modeling method of claim 1, wherein: in the process of constructing the high-coupling data model, a custom data structure concept is introduced into the basic information of the enterprise, and the custom data structure concept is used for storing the stockholder list and the Dong Jiangao list in the latest period.
6. The data modeling method of claim 5, wherein: in the high coupling data model construction process, historical stakeholders and Dong Jiangao data are expressed and stored by introducing an Array < custom data type > complex data structure.
7. The data modeling method as claimed in claim 1, wherein in step S3, the specific process of re-modeling according to the usage scenario includes: independently removing the fields in part of data types by analyzing the query fields, and putting the fields into an enterprise information redundancy table; and for the latest report period data, the latest current data and the historical data are divided into different fields for storage, so that repeated calculation of the latest report period in the query process is avoided.
8. The data modeling method as defined in claim 1, wherein in step S3, when the table structure is optimized in combination with HTAP database features, a risk factor storage table for storing and calculating risk factors is created, and fields in the table include item id, risk type, risk factor content and associated enterprise codes.
9. The data modeling method according to claim 8, wherein in step S3, when optimizing the table structure of the enterprise information redundant table, the index of the enterprise information redundant table is set to be enterprise codes, the distribution type is set to be distributed by the enterprise codes, and the storage mode is set to be heap storage.
10. The data modeling method according to claim 8, wherein in step S3, when optimizing the table structure of the risk factor storage table, the index of the risk factor storage table is set to be empty, the distribution type is set to be distributed according to the content of the risk factor, and the storage mode is set to be column-stored.
CN202311812725.6A 2023-12-26 2023-12-26 Data modeling method capable of improving calculation performance of risk screening system of publisher Pending CN117762900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311812725.6A CN117762900A (en) 2023-12-26 2023-12-26 Data modeling method capable of improving calculation performance of risk screening system of publisher

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311812725.6A CN117762900A (en) 2023-12-26 2023-12-26 Data modeling method capable of improving calculation performance of risk screening system of publisher

Publications (1)

Publication Number Publication Date
CN117762900A true CN117762900A (en) 2024-03-26

Family

ID=90325500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311812725.6A Pending CN117762900A (en) 2023-12-26 2023-12-26 Data modeling method capable of improving calculation performance of risk screening system of publisher

Country Status (1)

Country Link
CN (1) CN117762900A (en)

Similar Documents

Publication Publication Date Title
CN109669934B (en) Data warehouse system suitable for electric power customer service and construction method thereof
US11360950B2 (en) System for analysing data relationships to support data query execution
US20210011891A1 (en) System for importing data into a data repository
CN110096494B (en) Profiling data using source tracking
US8103704B2 (en) Method for database consolidation and database separation
US20070005619A1 (en) Method and system for detecting tables to be modified
CN102411600B (en) Data quality automatic detection method based on implication rule
CN103793422A (en) Methods for generating cube metadata and query statements on basis of enhanced star schema
US20100250517A1 (en) System and method for parallel computation of frequency histograms on joined tables
CA3176450A1 (en) Method and apparatus for implementing incremental data consistency
Hamad et al. An enhanced technique to clean data in the data warehouse
US10671641B1 (en) Method and computer program product for efficiently loading and synchronizing column-oriented databases
CN111400354A (en) Machine tool manufacturing BOM (Bill of Material) storage query and tree structure construction method based on MES (manufacturing execution System)
Girsang et al. Business intelligence for construction company acknowledgement reporting system
Gang et al. The research & application of Business Intelligence system in retail industry
CN106844320B (en) Financial statement integration method and equipment
CN113592378A (en) BOM construction method and management system of large complex equipment
CN117762900A (en) Data modeling method capable of improving calculation performance of risk screening system of publisher
Girsang et al. Decision support system using data warehouse for hotel reservation system
CN110866019A (en) Renewable quasi-real-time BI analysis system
Černjeka et al. NoSQL document store translation to data vault based EDW
CN113254544B (en) Data processing device and method based on dimension modeling
Chandra et al. Analysis Students' Graduation Eligibility Using Data Warehouse
CN114358812A (en) Multi-dimensional power marketing analysis method and system based on operation and maintenance big data
CN110413602B (en) Layered cleaning type big data cleaning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination