WO2023178767A1 - 基于企业征信大数据知识图谱的企业风险检测方法和装置 - Google Patents

基于企业征信大数据知识图谱的企业风险检测方法和装置 Download PDF

Info

Publication number
WO2023178767A1
WO2023178767A1 PCT/CN2022/087210 CN2022087210W WO2023178767A1 WO 2023178767 A1 WO2023178767 A1 WO 2023178767A1 CN 2022087210 W CN2022087210 W CN 2022087210W WO 2023178767 A1 WO2023178767 A1 WO 2023178767A1
Authority
WO
WIPO (PCT)
Prior art keywords
enterprise
big data
information
enterprise credit
data
Prior art date
Application number
PCT/CN2022/087210
Other languages
English (en)
French (fr)
Inventor
宋美娜
刘毓
鄂海红
欧中洪
张光卫
于勰
董亚飞
李国英
冯煜
国晓雪
郭京荆
Original Assignee
北京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京邮电大学 filed Critical 北京邮电大学
Publication of WO2023178767A1 publication Critical patent/WO2023178767A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Definitions

  • the present disclosure relates to the field of enterprise risk detection, and in particular to an enterprise risk detection method and device based on the enterprise credit big data knowledge graph.
  • the mainstream method is to extract the attributes of enterprise nodes in the knowledge graph as basic attribute features, and extract the relationship between the enterprise and other enterprise entities in the knowledge graph as association relationship features, and combine the basic attributes of the enterprise Features and relationship features are input as features of subsequent risk control models.
  • someone extracted the characteristic information of the enterprise in the network including the number and proportion of defaulting enterprises in the first-order and second-order neighbor relationships in the network, as the relationship characteristics of the enterprise, combined with the basic attribute characteristics of the enterprise, and input the gradient boosting decision tree classification Model.
  • the knowledge graph network consists of enterprise upstream and downstream, investment and financing, and closely related knowledge graphs, and uses community discovery algorithms to obtain the close relationships between enterprises.
  • the features used in the method are mainly divided into two categories.
  • the first category is basic attribute features (mainly enterprise data in the financial and judicial fields), and the second category is Association relationship characteristics (reflecting the close relationship between enterprise entities and other enterprise entities in the knowledge graph).
  • the present disclosure aims to solve one of the technical problems in the related art, at least to a certain extent.
  • this disclosure proposes an enterprise risk detection method based on the enterprise credit big data knowledge graph, including:
  • a unified information model of enterprise credit big data is obtained based on multiple dispersed data sub-domains; wherein, the unified information model of enterprise credit big data includes a hierarchical enterprise information architecture and a hierarchical key personnel architecture; through the hierarchical key personnel architecture
  • the enterprise information and the enterprise personnel information of the hierarchical enterprise information architecture are extracted to extract the relationship between key persons and enterprises to realize the cross-domain connection of enterprise credit big data; based on the enterprise credit big data that realizes the cross-domain connection Unify the information model and use a top-down approach to construct the ontology of the first enterprise credit big data field; and use a bottom-up construction mode to perform entity extraction and relationship extraction on the data in the enterprise credit big data field, and select High-quality new words expand the ontology scale of the first enterprise credit big data field to construct the second enterprise credit big data field ontology; based on the second enterprise credit big data field ontology, use the enterprise credit big data to construct an enterprise
  • the credit big data knowledge graph is stored in a graph database; the enterprise
  • the enterprise risk detection method based on the enterprise credit big data knowledge graph through strict top-down concept definition restrictions and relationship restrictions, and integrating the bottom-up approach to expand the ontology scale, it greatly improves the enterprise
  • the accuracy of the knowledge graph ontology in the field of credit reporting lays a solid foundation for the subsequent generation of high-quality knowledge graphs. It also innovatively introduces the characteristics of corporate R&D and innovation capabilities as the input of the risk control model, improving the accuracy of the knowledge graph ontology in the field of corporate credit reporting. The accuracy also improves the performance of the risk control model.
  • the hierarchical enterprise information architecture of the enterprise credit big data unified information model includes: enterprise basic information, enterprise personnel information, enterprise operating information, enterprise asset information, enterprise intellectual property information, enterprise financial information, Various in the sub-domains of corporate equity information, judicial data, corporate risk information and auxiliary reference information.
  • the bottom-up construction method is used to perform entity extraction and relationship extraction on the data in the enterprise credit big data field, and select high-quality new words to expand the first enterprise credit big data field.
  • the ontology scale is used to construct the ontology of the second enterprise credit big data field, including: using a bottom-up construction method to perform entity extraction and relationship extraction on the data in the enterprise credit big data field; based on the entity extraction and Relation extraction, identifying named entities and relationship instances in the data, and making quality judgments on the named entities and relationship instances that cannot be identified; determining the quality ranking based on the quality judgment, selecting high-quality new words and expanding the Describe the first enterprise credit information big data domain ontology to construct the second enterprise credit information big data domain ontology.
  • the acquisition of enterprise characteristic data includes: acquiring the enterprise's basic attribute characteristics, association relationship characteristics, and R&D innovation capability characteristics; wherein, acquiring the enterprise's basic attribute characteristics, association relationship characteristics, and R&D innovation capability characteristics from the enterprise credit big data knowledge map. Attribute characteristics and the R&D and innovation capability characteristics of the enterprise; and, extract enterprise relationship features through four types of relationships, and extract network features in the enterprise credit big data knowledge graph through the shortest path algorithm and community discovery algorithm to obtain Characteristics of the associated relationships of the enterprise; wherein, the four types of relationships include equity participation relationships, investment relationships, transaction relationships and litigation relationships.
  • the risk control model includes: data preprocessing, feature processing engineering, and result classification.
  • the data preprocessing includes: preprocessing the obtained enterprise characteristic data, converting date data into character variables, and then converting all character variables to obtain numerical data, Extract the IV value, WOE, efficiency and rate of the numerical data.
  • the formula for IV value, WOE, efficiency and rate is:
  • Good i and Bad i represent the statistics of the number of non-defaulting companies and the number of defaulting companies in each bin
  • Good T and Bad T represent the total number of non-defaulting companies and the number of defaulting companies respectively.
  • the feature processing project includes: deleting features with more than 50% missing values, features containing only unique values, features with correlations higher than 60% with other features, and feature importance in the gradient enhancer.
  • a feature of 0.0 is a low-importance feature from the gradient enhancer that does not contribute 99% of the cumulative feature importance.
  • the classification of results includes: obtaining the enterprise characteristic data samples and enterprise labels; using the enterprise characteristic data samples and enterprise labels to supervisedly train the LightGBM classification model to obtain a trained LightGBM classification model; The features processed by the feature processing project are input into the trained LightGBM classification model, and the classification results are obtained by calculation and classification; wherein, the classification results are divided into default and normal.
  • this disclosure proposes an enterprise risk detection device based on the enterprise credit big data knowledge graph, including:
  • An information acquisition module is used to obtain a unified information model of enterprise credit big data based on multiple dispersed data subdomains; wherein the unified information model of enterprise credit big data includes a hierarchical enterprise information architecture and a hierarchical key personnel architecture;
  • the relationship connection module is used to extract the relationship between key persons and enterprises through the enterprise information of the hierarchical key personnel structure and the enterprise personnel information of the hierarchical enterprise information architecture, so as to realize the cross-domain connection of enterprise credit big data. ;
  • the ontology building module is used to determine the enterprise credit big data field using a top-down approach and build the first enterprise credit big data field ontology based on the enterprise credit big data unified information model that realizes the cross-domain connection; and through In a bottom-up construction method, entity extraction and relationship extraction are performed on the data in the enterprise credit big data field, high-quality new words are selected and the ontology scale of the first enterprise credit big data field is expanded to build the second enterprise Credit big data domain ontology;
  • a graph building module configured to use the enterprise credit big data to construct an enterprise credit big data knowledge graph based on the second enterprise credit big data domain ontology and store it in the graph database;
  • the calculation classification module is used to obtain enterprise characteristic data using the enterprise credit big data knowledge graph, input the acquired enterprise characteristic data into the trained risk control model, perform calculation and classification, and output the classification results.
  • the enterprise risk detection device based on the enterprise credit big data knowledge graph in the disclosed embodiment adopts top-down strict concept definition restrictions and relationship restrictions, and integrates the bottom-up approach to expand the ontology scale, which greatly improves the enterprise credit reporting system.
  • the accuracy of the knowledge graph ontology in the credit field lays a solid foundation for the subsequent generation of high-quality knowledge graphs, and innovatively introduces the characteristics of corporate R&D and innovation capabilities as input to the risk control model, improving the accuracy of the knowledge graph ontology in the corporate credit field. It also improves the performance of the risk control model.
  • Another embodiment of the present disclosure provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the computer program is executed by the processor, the above is implemented.
  • the enterprise risk detection method based on the enterprise credit big data knowledge graph.
  • Another aspect of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the above-mentioned system based on the enterprise credit big data knowledge graph is implemented. Enterprise risk detection methods.
  • Another aspect of the present disclosure provides a computer program product, which includes computer instructions.
  • the computer instructions are executed by at least one processor, the enterprise risk detection method based on the enterprise credit big data knowledge graph is implemented as described above.
  • the enterprise credit big data knowledge graph construction technology proposed in this disclosure solves the problem of missing information in existing enterprise credit graphs at this stage.
  • the risk control model proposed in this disclosure that introduces the characteristics of corporate R&D and innovation capabilities surpasses traditional risk control models based on corporate credit knowledge maps, making it easier to identify defaulting companies in advance and reduce risks.
  • Figure 1 is a schematic diagram of the enterprise risk detection architecture based on the enterprise credit big data knowledge graph according to an embodiment of the present disclosure
  • Figure 2 is a flow chart of an enterprise risk detection method based on the enterprise credit big data knowledge graph according to an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of the architecture of hierarchical enterprise information of the enterprise credit big data unified information model according to an embodiment of the present disclosure
  • Figure 4(a) and Figure 4(b) are schematic diagrams of the secondary architecture of enterprise financial information of the enterprise credit big data unified information model according to an embodiment of the present disclosure
  • Figure 5 is a schematic diagram of the hierarchical key personnel architecture of the enterprise credit big data unified information model according to an embodiment of the present disclosure
  • Figure 6 is a schematic flow chart of the enterprise credit big data knowledge graph ontology supplemented by top-down and bottom-up according to an embodiment of the present disclosure
  • Figure 7 is a schematic flow chart of risk control model design according to an embodiment of the present disclosure.
  • Figure 8 is a schematic structural diagram of an enterprise risk detection device based on the enterprise credit big data knowledge graph according to an embodiment of the present disclosure.
  • the overall process of the enterprise risk detection method based on the enterprise credit big data knowledge graph according to the disclosed embodiment is shown in Figure 1.
  • the embodiment of the present disclosure adds the characteristics of enterprise R&D innovation capabilities to increase the level and dimension of the characteristics.
  • the enterprise risk control model in the enterprise risk monitoring method of the embodiment of the present disclosure uses LightGBM, because LightGBM actually uses a gradient boosting framework based on a decision tree algorithm. Therefore, LightGBM can also obtain the importance of features to the model during the training process, which can be used to evaluate the impact of different features on whether a company defaults.
  • Figure 2 is a flow chart of an enterprise risk detection method based on the enterprise credit big data knowledge graph according to an embodiment of the present disclosure.
  • the enterprise risk detection method based on the enterprise credit big data knowledge graph includes the following steps:
  • Step S1 Obtain a unified information model of enterprise credit big data based on multiple dispersed data subdomains; among which, the unified information model of enterprise credit big data includes a hierarchical enterprise information architecture and a hierarchical key personnel architecture.
  • This disclosed embodiment uses expert knowledge to study a series of relevant enterprise credit data standard systems, and investigates papers and patents related to the enterprise credit knowledge graph, from existing dispersed data subdomains such as government affairs, industry and commerce, justice, and public opinion.
  • the "Enterprise-Key Personnel" joint framework is extracted from the system, and a set of hierarchical enterprise information architecture and key personnel architecture are designed for the big data scenario of enterprise credit investigation. The relationship between various entities is used as the connecting edge to realize enterprise credit investigation. Global entity association for big data.
  • the hierarchical enterprise information architecture of the enterprise credit big data unified information model consists of enterprise basic information, enterprise personnel information, enterprise operating information, enterprise asset information, enterprise intellectual property information, enterprise financial information, enterprise equity information, judicial data, and enterprise risk information.
  • auxiliary reference and other 10 information sub-domains jointly support the hierarchical enterprise information architecture of the enterprise credit big data unified information model as shown in Figure 3.
  • enterprise financial data is taken as an example to show a fine-grained view of the enterprise information architecture.
  • Step S2 Extract the relationship between key persons and the enterprise through the enterprise information of the hierarchical key personnel structure and the enterprise personnel information of the hierarchical enterprise information architecture to realize cross-domain connection of enterprise credit big data.
  • the view of the hierarchical key personnel structure of the enterprise credit big data unified information model is composed of four information subdomains such as basic information, work information, social relations, and historical risks.
  • the enterprise personnel information in the information architecture can break through the association barriers between the architecture and the enterprise architecture, forming a mapping relationship between entity objects, thereby realizing the hierarchization and correlation of the "enterprise-key personnel" of credit reporting big data, and initially solving the problem. It solves the problem of difficulty in cross-domain connection of enterprise credit big data.
  • FIG. 5 it is a view of the hierarchical key personnel architecture of the enterprise credit big data unified information model.
  • Step S3 Based on the unified information model of enterprise credit big data that realizes cross-domain connection, use the top-down approach to build the first enterprise credit big data field ontology; and use the bottom-up construction method to build the enterprise credit big data Entity extraction and relationship extraction are performed on the data in the field, and high-quality new words are selected to expand the ontology scale of the first enterprise credit big data field to build the second enterprise credit big data field ontology.
  • the first step in building a high-quality corporate credit big data knowledge graph is to define an accurate and clear knowledge schema, that is, to provide an ontology that describes the basic cognitive framework in the field of corporate credit reporting.
  • traditional construction methods that only focus on "top-down methods” rely heavily on domain experts.
  • the "bottom-up method” and massive, multi-source, heterogeneous data are huge challenges for bottom-up construction of ontology and subsequent knowledge integration.
  • a corporate credit big data knowledge graph ontology construction method based on "top-down mainly, bottom-up supplementary" is used to constrain concepts through a top-down method. and relationships, and integrates the bottom-up method to expand the scale of the ontology, which greatly improves the accuracy and sophistication of the knowledge graph ontology, laying a solid foundation for the subsequent generation of high-quality knowledge graphs.
  • the specific construction process is shown in Figure 6.
  • Domain knowledge bases include but are not limited to Internet knowledge bases, encyclopedia websites, industry authoritative guides, metadata national standards and relational databases in the field.
  • the "enterprise-key personnel system" mentioned in the embodiment of this disclosure based on the hierarchical enterprise information architecture and key personnel information architecture summarizes the massive data resources in the field of enterprise credit big data in an orderly manner. From this label system, high-quality concepts and attributes in the field of corporate credit reporting can be screened out, as well as the relationships between concepts, and a prototype of the domain ontology can be constructed.
  • the domain ontology created using a top-down approach has been able to guide the construction of an enterprise credit big data knowledge graph instance library.
  • the ontology model of the enterprise credit area constructed only in a top-down manner is limited in scale and cannot meet the needs of subsequent knowledge graph construction technologies (such as knowledge extraction and knowledge fusion).
  • knowledge graph construction technologies such as knowledge extraction and knowledge fusion.
  • the bottom-up construction method is also an important part of the enterprise credit reporting field. It is an important part of the ontology and data construction of big data knowledge graph.
  • the bottom-up auxiliary line construction process starts with entity extraction and relationship extraction of data in the corporate credit field, extracts named entities and relationship instances in the data, and performs quality judgment on failure to identify named entities and relationship instances. .
  • Credit experts determine whether the new words with high quality ranking are high-quality phrases and expand the current ontology structure of the enterprise credit field.
  • Step S4 Based on the second enterprise credit big data domain ontology, use the enterprise credit big data to construct an enterprise credit big data knowledge graph and store it in the graph database.
  • the existing enterprise credit big data is used to construct the knowledge graph and stored in the Neo4j graph database to provide a data basis for subsequent risk control models.
  • Step S5 Use the enterprise credit big data knowledge graph to obtain enterprise characteristic data, input the acquired enterprise characteristic data into the trained risk control model for calculation and classification, and output the classification results.
  • the basic attribute characteristics, association relationship characteristics, and R&D innovation capability characteristics of the enterprise are obtained from the enterprise credit big data knowledge map, processed, and used together as the input of the risk control model to classify LightGBM
  • the model undergoes supervised training.
  • the processing flow of the embodiment of the present disclosure is shown in Figure 7, including:
  • the enterprise's basic attribute capability characteristics and R&D innovation capability characteristics exist in the form of enterprise node attributes, which can be directly exported from the Neo4j graph database.
  • the characteristic of corporate affiliation is to reflect the close relationship between the corporate entity and the defaulting corporate entity. Since there are various types of nodes and edges in heterogeneous networks, the characteristics and difficulty of extracting graphs are increased. Therefore, the proposal limits the enterprise credit big data knowledge graph to a homogeneous network, and restricts the nodes at both ends of the relationship to only enterprises. , and fold and reduce the character nodes to reduce the interference of characters on the network and ensure that every relationship is between enterprises.
  • Table 2 Enterprise association relationship table
  • Enterprise data contains many attributes in pure string format, such as enterprise type, industry category and other specific length code data. It also contains date-type data such as establishment date and approval date. For date data, first convert it into numerical data in seconds, and then convert it into character format. Then convert all character variables into numerical data, and extract their IV value (Information Value), WOE, efficiency, and rate.
  • date data first convert it into numerical data in seconds, and then convert it into character format. Then convert all character variables into numerical data, and extract their IV value (Information Value), WOE, efficiency, and rate.
  • Good i and Bad i represent the statistics of the number of non-defaulting companies and the number of defaulting companies in each bin.
  • Good T and Bad T represent the total number of non-defaulting companies and the number of defaulting companies respectively.
  • the features In the feature engineering process, in order to deal with problems such as a large number of missing values in the original data and excessive correlation between features, the features first need to be processed.
  • the main steps are to delete features with more than 50% missing values, features that only contain unique values, features that are more than 60% correlated with other features, and features with a feature importance of 0.0 in the gradient booster (gbm), from gbm Low importance features that do not contribute to 99% of the cumulative feature importance.
  • This module uses the LightGBM algorithm.
  • the features processed by the feature engineering module are input into the model to obtain the classification results.
  • the results are classified into two categories: default and normal. Because LightGBM actually uses a gradient boosting framework based on the decision tree algorithm. Therefore, LightGBM can obtain the importance of features to the model during the training process. The importance of features can be used to evaluate the impact of different features on whether a company defaults.
  • the knowledge graph lays a solid foundation, and innovatively introduces the characteristics of corporate R&D and innovation capabilities as the input of the risk control model, which improves the accuracy of the knowledge graph ontology in the field of corporate credit reporting and improves the performance of the risk control model.
  • this embodiment also provides an enterprise risk detection device 10 based on the enterprise credit big data knowledge graph.
  • the device 10 includes: an information acquisition module 100, a relationship connection module 200 , ontology building module 300, graph building module 400, calculation classification module 500.
  • the information acquisition module 100 is used to obtain a unified information model of enterprise credit big data based on multiple dispersed data sub-domains; wherein the unified information model of enterprise credit big data includes a hierarchical enterprise information architecture and a hierarchical key personnel architecture.
  • the relationship connection module 200 is used to extract the relationship between key persons and enterprises through the enterprise information of the hierarchical key personnel structure and the enterprise personnel information of the hierarchical enterprise information structure, so as to realize cross-domain connection of enterprise credit big data.
  • the ontology building module 300 is used to use a top-down approach to determine the field of corporate credit big data and build the first corporate credit big data field ontology based on the unified information model of corporate credit big data that realizes cross-domain connections; and through automatic A bottom-up construction method is used to extract entities and relationships from data in the field of corporate credit big data, select high-quality new words, and expand the ontology scale of the first corporate credit big data field to build the second corporate credit big data field. ontology.
  • the graph construction module 400 is used to construct an enterprise credit big data knowledge graph based on the second enterprise credit big data domain ontology using the enterprise credit big data and store it in the graph database.
  • the calculation and classification module 500 is used to obtain enterprise characteristic data using the enterprise credit big data knowledge graph, input the acquired enterprise characteristic data into the trained risk control model, perform calculation and classification, and output the classification results.
  • the enterprise risk detection device based on the enterprise credit big data knowledge graph according to the embodiment of the present disclosure, through strict top-down concept definition restrictions and relationship restrictions, and integrating the bottom-up approach to expand the ontology scale, it greatly improves the enterprise
  • the accuracy of the knowledge graph ontology in the field of credit reporting lays a solid foundation for the subsequent generation of high-quality knowledge graphs. It also innovatively introduces the characteristics of corporate R&D and innovation capabilities as the input of the risk control model, improving the accuracy of the knowledge graph ontology in the field of corporate credit reporting. The accuracy also improves the performance of the risk control model.
  • the embodiment of the present application proposes a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the computer program is executed by the processor, the above-mentioned steps are implemented.
  • Enterprise risk detection method based on enterprise credit big data knowledge graph.
  • the embodiment of the present application proposes a non-transitory computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the enterprise risk detection based on the enterprise credit big data knowledge graph is implemented as described above. method.
  • the embodiment of the present application proposes a computer program product, which includes computer instructions.
  • the computer instructions are executed by at least one processor, the enterprise risk detection method based on the enterprise credit big data knowledge graph is implemented as described above.
  • a "computer-readable medium” may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Non-exhaustive list of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disc read-only memory (CDROM).
  • the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, and subsequently edited, interpreted, or otherwise suitable as necessary. process to obtain the program electronically and then store it in computer memory.
  • various parts of the present disclosure may be implemented in hardware, software, firmware, or combinations thereof.
  • various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: discrete logic gate circuits with logic functions for implementing data signals; Logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
  • the program can be stored in a computer-readable storage medium.
  • the program can be stored in a computer-readable storage medium.
  • each functional unit in various embodiments of the present disclosure may be integrated into one processing module, each unit may exist physically alone, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the storage media mentioned above can be read-only memory, magnetic disks or optical disks, etc.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise expressly and specifically limited.
  • references to the terms “one embodiment,” “some embodiments,” “an example,” “specific examples,” or “some examples” or the like means that specific features are described in connection with the embodiment or example. , structures, materials, or features are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

本公开公开了基于企业征信大数据知识图谱的企业风险检测方法和装置,其中,该方法包括:通过分散数据子域数据构建企业征信大数据统一信息模型;基于企业征信大数据统一信息模型,利用自顶向下方式构建第一企业征信大数据领域本体;以及通过自底向上的构建方式对企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体;基于构建好的本体,利用企业征信大数据构建企业征信大数据知识图谱,通过知识图谱进行特征获取,将获取的特征数据输入训练好的风控模型输出分类结果,并用于分类企业。本公开提升了企业征信领域知识图谱本体的精确性,提升了风控模型的性能。

Description

基于企业征信大数据知识图谱的企业风险检测方法和装置
相关申请的交叉引用
本申请基于申请号为202210302732.0、申请日为2022年03月24日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及企业风险检测领域,尤其涉及一种基于企业征信大数据知识图谱的企业风险检测方法和装置。
背景技术
目前在基于知识图谱的企业风险检测方法中,主流的方式是提取知识图谱中企业节点属性作为基本属性特征,以及提取知识图谱中企业与其余企业实体的关系作为关联关系特征,将企业的基本属性特征与关联关系特征一并作为后续风控模型的特征进行输入。有人提取企业在网络中的特征信息,包括其在网络中的一阶、二阶邻居关系中违约企业的数量和比重等作为企业的关系特征,结合企业的基本属性特征,输入梯度提升决策树分类模型。有人根据业务与数据背景,定义了与企业风险相关的三种知识图谱。知识图谱网络分别为企业上下游、投资融资、密切关联知识图谱,并使用社区发现算法获取企业之间的亲疏关系。有人通过股权关系、人事关系等数据,全面挖掘企业关联,构建企业征信知识图谱,基于图谱构建了两种模型,分别为企业关联关系分析模型,企业群体关联风险模型,帮助在商业银在信贷全流程中识别企业风险。
如上所述,目前基于知识图谱的企业风险检测方法中,方法所用到的特征主要分为两类,第一类为基本属性特征(主要是企业在金融、司法领域的数据),第二类为关联关系特征(体现知识图谱中企业实体与其余企业实体密切关系)。
但基于征信数据隐私性较强的特点,不同行业不能共享征信数据,征信数据存在不全面和信息孤岛的挑战。构建企业征信图谱的基础是企业征信数据,因此现阶段已有的企业征信图谱都存在信息缺失等问题,企业征信图谱中企业实体属性仅来源于金融、司法等领域,难以完全表示企业的信用状况,数据维度有待增加,模型效果有待提升。
发明内容
本公开旨在至少在一定程度上解决相关技术中的技术问题之一。
本公开一方面提出了基于企业征信大数据知识图谱的企业风险检测方法,包括:
基于多个分散数据子域获得企业征信大数据统一信息模型;其中,所述企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构;通过所述层次化关键人员架 构的企业信息和所述层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接;基于实现所述跨域连接的企业征信大数据统一信息模型,利用自顶向下方式构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体;基于所述第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中;利用所述企业征信大数据知识图谱进行企业特征数据获取,将获取的所述企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
根据本公开实施例的基于企业征信大数据知识图谱的企业风险检测方法,通过自顶向下严格的概念定义限制和关系限制,并融合自底向上的方式扩充本体规模,极大提升了企业征信领域知识图谱本体的精确性,为之后生成高质量的知识图谱打下坚实基础,并且创新性的引入了企业研发创新能力特征作为风控模型的输入,提升了企业征信领域知识图谱本体的精确性,也提升了风控模型的性能。
在一些实施方式中,所述企业征信大数据统一信息模型的层次化企业信息架构,包括:企业基本信息、企业人员信息、企业经营信息、企业资产信息、企业知识产权信息、企业财务信息、企业股权信息、司法数据、企业风险信息和辅助参考信息子域中的多种。
在一些实施方式中,所述通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体,包括:通过自底向上构建方式,对所述户企业征信大数据领域中的数据进行实体抽取和关系抽取;基于所述实体抽取和关系抽取,识别出所述数据中的命名实体与关系实例,并对于未能识别出的所述命名实体与关系实例进行质量判定;基于所述质量判定确定质量排名,选取优质新词并扩展所述第一企业征信大数据领域本体,以构建所述第二企业征信大数据领域本体。
在一些实施方式中,所述企业特征数据获取,包括:获取企业的基本属性特征、关联关系特征和研发创新能力特征;其中,从所述企业征信大数据知识图谱中获取所述企业的基本属性特征和所述企业的研发创新能力特征;以及,通过四类关系进行企业关系特征抽取,并通过最短路径算法以及社区发现算法,提取所述企业征信大数据知识图谱中的网络特征以获取所述企业的关联关系特征;其中,所述四类关系包括参股关系、投资关系、交易关系和诉讼关系。
在一些实施方式中,所述风控模型,包括:数据预处理、特征处理工程和结果分类。
在一些实施方式中,所述数据预处理,包括:对获取的所述企业特征数据进行预处理,将日期型数据转化为字符型变量,然后对全部字符型变量进行转化,得到数值型数据,提取所述数值型数据的IV值、WOE、efficiency和rate。
在一些实施方式中,所述IV值、WOE、efficiency和rate的公式为:
Figure PCTCN2022087210-appb-000001
Figure PCTCN2022087210-appb-000002
Figure PCTCN2022087210-appb-000003
Figure PCTCN2022087210-appb-000004
其中,Good i和Bad i表示统计每个分箱里的未违约企业数和违约企业数,Good T和Bad T分别表示总的未违约企业数和违约企业数。
在一些实施方式中,所述特征处理工程,包括:删除缺失值超过50%的特征、只含有唯一值的特征、和其他特征相关性高于60%的特征、在梯度增强器中特征重要性为0.0的特征,从所述梯度增强器中不贡献累积特征重要性99%的低重要性特征。
在一些实施方式中,所述结果分类,包括:获取所述企业特征数据样本和企业标签;利用所述企业特征数据样本和企业标签有监督的训练LightGBM分类模型,得到训练好的LightGBM分类模型;将所述特征处理工程处理后的特征,输入所述训练好的LightGBM分类模型,进行计算分类得到分类结果;其中,所述分类结果分为违约与正常。
本公开另一方面提出了一种基于企业征信大数据知识图谱的企业风险检测装置,包括:
信息获取模块,用于基于多个分散数据子域获得企业征信大数据统一信息模型;其中,所述企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构;
关系连接模块,用于通过所述层次化关键人员架构的企业信息和所述层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接;
本体构建模块,用于基于实现所述跨域连接的企业征信大数据统一信息模型,利用自顶向下方式确定企业征信大数据领域并构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词并扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体;
图谱构建模块,用于基于所述第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中;
计算分类模块,用于利用所述企业征信大数据知识图谱进行企业特征数据获取,将获取的所述企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
本公开实施例的基于企业征信大数据知识图谱的企业风险检测装置,通过自顶向下严格的概念定义限制和关系限制,并融合自底向上的方式扩充本体规模,极大提升了企业征信领域知识图谱本体的精确性,为之后生成高质量的知识图谱打下坚实基础,并且创新性的引入了企业研发创新能力特征作为风控模型的输入,提升了企业征信领域知识图谱本体的精确性,也提升了风控模型的性能。
本公开另一方面实施例提出了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
本公开另一方面实施例提出了一种非临时性计算机可读存储介质,其上存储有计算机程 序,所述计算机程序被处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
本公开另一方面实施例提出了一种计算机程序产品,包括计算机指令,所述计算机指令被至少一个处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
本公开提出的企业征信大数据知识图谱构建技术,解决了现阶段已有的企业征信图谱都存在信息缺失等问题。
本公开提出的引入企业研发创新能力特征的风控模型,性能超越了传统的基于企业征信知识图谱的风控模型,便于提前识别违约企业,降低风险。
本公开附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本公开的实践了解到。
附图说明
本公开上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为根据本公开实施例的基于企业征信大数据知识图谱的企业风险检测架构示意图;
图2为根据本公开实施例的基于企业征信大数据知识图谱的企业风险检测方法的流程图;
图3为根据本公开实施例的企业征信大数据统一信息模型的层次化企业信息的架构示意图;
图4(a)和图4(b)为根据本公开实施例的企业征信大数据统一信息模型的企业财务信息二级架构的示意图;
图5为根据本公开实施例的企业征信大数据统一信息模型的层次化关键人员的架构示意图;
图6为根据本公开实施例的自顶向下、自底向上为辅的企业征信大数据知识图谱本体的流程示意图;
图7为根据本公开实施例的风控模型设计的流程示意图;
图8为根据本公开实施例的基于企业征信大数据知识图谱的企业风险检测装置的结构示意图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。
下面参照附图描述根据本公开实施例提出的基于企业征信大数据知识图谱的企业风险检测方法及装置,首先将参照附图描述根据本公开实施例提出的基于企业征信大数据知识图谱的企业风险检测方法。
本公开实施例的基于企业征信大数据知识图谱的企业风险检测方法,整体流程如图1所示。在原始的风控模型基础上,本公开实施例增加了企业研发创新能力特征来增加特征的层次与维度。除了特征的增加,本公开实施例的企业风险监测方法中企业风控模型使用的是LightGBM,由于LightGBM实际上采用的是基于决策树算法的梯度提升框架。因此LightGBM在训练过程中还能可以得到特征对于模型的重要性,用于评价不同特征对企业是否违约的影响程度。
图2是本公开一个实施例的基于企业征信大数据知识图谱的企业风险检测方法的流程图。
如图2所示,该基于企业征信大数据知识图谱的企业风险检测方法包括以下步骤:
步骤S1,基于多个分散数据子域获得企业征信大数据统一信息模型;其中,企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构。
本公开实施例通过运用专家知识、研究一系列相关的企业征信数据标准体系,调研企业征信知识图谱相关的论文与专利,从现有的政务、工商、司法、舆论等各分散数据子域中提炼出“企业-关键人物“联合框架,设计出一套面向企业征信大数据场景的层次化企业信息架构和关键人员架构,以各类实体间关系的为连接边,实现了企业征信大数据的全域实体关联。
企业征信大数据统一信息模型的层次化企业信息架构由企业基本信息,企业人员信息,企业经营信息,企业资产信息,企业知识产权信息,企业财务信息,企业股权信息,司法数据,企业风险信息,辅助参考等10个信息子域共同支撑,企业征信大数据统一信息模型的层次化企业信息架构如图3所示。
如图4(a)和图4(b)所示,以企业财务数据为例,展示企业信息架构的细粒度视图。
步骤S2,通过层次化关键人员架构的企业信息和层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接。
可以理解的是,企业征信大数据统一信息模型层次化关键人员架构的视图由基础信息、工作信息、社会关系、历史风险等四个信息子域构成,借助关键人员架构里的企业信息和企业信息架构里的企业人员信息,即可打通该架构与企业架构的关联壁垒,形成实体对象之间的映射关系,从而实现征信大数据“企业-关键人员”的层次化和关联化,初步解决了企业征信大数据跨域连接难的问题。
如图5所示,为企业征信大数据统一信息模型的层次化关键人员架构的视图。
面向企业征信大数据场景的层次化企业信息架构和关键人物信息架构,希望以“双核心”的方式实现企业征信大数据全域实体关联,需要定义实体之间的关系。实体关系设置如表1。
表1:实体关系设计表
Figure PCTCN2022087210-appb-000005
Figure PCTCN2022087210-appb-000006
步骤S3,基于实现跨域连接的企业征信大数据统一信息模型,利用自顶向下方式构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体。
构建高质量企业征信大数据知识图谱的第一步就是定义准确清晰的知识模式(schema),即给出描述企业征信领域基本认知框架的本体。然而传统的仅聚焦于“自顶向下法”的构建方法对领域专家的依赖性较大。而“自底向上法”以及海量、多源、异构的数据,是自底向上构建本体和后续知识融合的巨大挑战。
基于单一知识图谱本体构建方法存在的缺陷,使用一种基于“自顶向下为主,自底向上为辅”的企业征信大数据知识图谱本体构建方法,通过自顶向下的方法约束概念与关系,并融合自底向上的方法扩充本体的规模,极大提升了知识图谱本体的精确度和精细程度,为之后生成高质量的知识图谱打下坚实基础,具体建设流程如图6所示。
利用自顶向下方式形成领域本体,需要挖掘领域知识库的知识和听取领域专家的建议,构建领域本体。领域知识库包括但不仅限该领域的互联网知识库、百科网站、行业权威指南、元数据国家标准和关系型数据库等。例如,本公开实施例提及的基于层次化企业信息架构和关键人员信息架构归纳的“企业-关键人员体系”,有序组织了企业征信大数据领域的海量数据资源。从该标签体系中,可以筛选出企业征信领域高质量的概念和属性,以及概念之间的相互关系,构建领域本体雏形。
利用自顶向下的方法创建的领域本体,已经能够指导构建企业征信大数据知识图谱实例库。但是由于企业征信领域数据资源规模的增长,仅由自顶向下方式构建的企业征信领域本体模型,限制于规模,无法满足后续知识图谱构建技术(如知识抽取与知识融合)的需求。企业征信领域多源、海量、异构的数据资源如果能够加以整理、利用、完善,可以为企业征 信领域的知识图谱构建生成巨大的数据推动力,因此自底向上的构建方式也是企业征信大数据知识图谱本体和数据构建中的重要一环。自底向上的辅线构建流程,首先是对企业征信领域中数据进行实体抽取和关系抽取,提取出该数据中的命名实体与关系实例,对于未能识别出命名实体与关系实例进行质量判定。征信专家判定质量排名靠前的新词是否为高质量短语,并扩展当前的企业征信领域本体结构。
步骤S4,基于第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中。
利用上述方法构建企业征信大数据知识图谱本体后,利用已有的企业征信大数据构建知识图谱并存储进Neo4j图数据库中,为后续风控模型提供数据基础。
步骤S5,利用企业征信大数据知识图谱进行企业特征数据获取,将获取的企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
在企业风控模型模块中,从企业征信大数据知识图谱中获取企业的基本属性特征、关联关系特征、研发创新能力特征,将其进行处理,共同做为风控模型的输入,对LightGBM分类模型做有监督训练。通过引入企业研发创新能力特征,提升了风控模型的性能。本公开实施例的处理流程如图7所示,包括:
(1)数据获取模块:
在企业征信大数据知识图谱中,企业基本属性能力特征与研发创新能力特征都以企业节点属性的形式存在,直接从Neo4j图数据库中导出即可。企业关联关系特征是为了体现该企业实体与违约企业实体的亲疏关系。由于异构网络中的节点与边的类型多种多样,提取图的特征和困难程度被提高,因此提案限定本企业征信大数据知识图谱为同构网络,限制关系两端的节点只能为企业,并将人物节点进行折叠归约,降低人物对网络的干扰,保证每条关系都介于企业间。结合已有的数据和传统认知逻辑,保留四类风险较高的企业关系:参股关系、投资关系、交易关系、诉讼关系。并基于这四类关系进行企业关系特征抽取,提取知识图谱中网络特征的方式为最短路径算法以及社区发现算法。
提取的网络特征如表2所示:
表2:企业关联关系表
Figure PCTCN2022087210-appb-000007
研发创新能力特征如表3所示:
表3:研发创新能力类
Figure PCTCN2022087210-appb-000008
(2)数据预处理模块:
对使用信用评分卡提取改非数据型数据的IV值(Information Value)、WOE、efficiency、rate作为模型新增的特征进行后续的处理。
企业数据中含有许多纯字符串格式的属性,例如企业类型,行业门类等特定长度代码型数据。也包含成立日期,核准日期等日期型数据。对于日期型数据,首先现对其进行转化,将其统一转化为秒单位的数值型数据,再将其转化为字符型格式。然后对全部的字符型变量进行转化,使其变成数值型数据,提取其IV值(Information Value)、WOE、efficiency、rate。
WOE、IV、Efficiency、rate的公式如下:
Figure PCTCN2022087210-appb-000009
Figure PCTCN2022087210-appb-000010
Figure PCTCN2022087210-appb-000011
Figure PCTCN2022087210-appb-000012
其中Good i和Bad i表示统计每个分箱里的未违约企业数和违约企业数。Good T和Bad T分别表示总的未违约企业数和违约企业数。
(3)特征工程模块:
在特征工程环节,为了处理原始数据存在大量缺失值,且特征之间相关性过高等问题,首先需要对特征进行处理。主要步骤分为删除缺失值超过50%的特征、只含有唯一值的特征、和其他特征相关性高于60%的特征、在梯度增强器(gbm)中特征重要性为0.0的特征,从gbm中不贡献累积特征重要性99%的低重要性特征。
(4)分类模块:
该模块使用的是LightGBM算法,将特征工程模块处理过后的特征输入模型中即可得到分类结果,结果分类两种,违约与正常。由于LightGBM实际上采用的是基于决策树算法的梯度提升框架。因此LightGBM在训练过程中可以得到特征对于模型的重要性。特征的重要性程度能够作为评价不同特征对企业是否违约的影响程度。
通过上述步骤,通过自顶向下严格的概念定义限制和关系限制,并融合自底向上的方式扩充本体规模,极大提升了企业征信领域知识图谱本体的精确性,为之后生成高质量的知识图谱打下坚实基础,并且创新性的引入了企业研发创新能力特征作为风控模型的输入,提升了企业征信领域知识图谱本体的精确性,也提升了风控模型的性能。
为了实现上述实施例,如图8所示,本实施例中还提供了一种基于企业征信大数据知识图谱的企业风险检测装置10,该装置10包括:信息获取模块100,关系连接模块200,本体构建模块300,图谱构建模块400,计算分类模块500。
信息获取模块100,用于基于多个分散数据子域获得企业征信大数据统一信息模型;其中,企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构。
关系连接模块200,用于通过层次化关键人员架构的企业信息和层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接。
本体构建模块300,用于基于实现跨域连接的企业征信大数据统一信息模型,利用自顶向下方式确定企业征信大数据领域并构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词并扩充第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体。
图谱构建模块400,用于基于第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中。
计算分类模块500,用于利用企业征信大数据知识图谱进行企业特征数据获取,将获取的企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
根据本公开实施例的基于企业征信大数据知识图谱的企业风险检测装置,通过自顶向下严格的概念定义限制和关系限制,并融合自底向上的方式扩充本体规模,极大提升了企业征信领域知识图谱本体的精确性,为之后生成高质量的知识图谱打下坚实基础,并且创新性的引入了企业研发创新能力特征作为风控模型的输入,提升了企业征信领域知识图谱本体的精确性,也提升了风控模型的性能。
需要说明的是,前述对基于企业征信大数据知识图谱的企业风险检测方法实施例的解释说明也适用于该实施例的基于企业征信大数据知识图谱的企业风险检测装置,此处不再赘述。
本申请实施例提出了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
本申请实施例提出了一种非临时性计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
本申请实施例提出了一种计算机程序产品,包括计算机指令,所述计算机指令被至少一个处理器执行时实现如上所述的基于企业征信大数据知识图谱的企业风险检测方法。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本公开的实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本公开的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本公开的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本公开各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (13)

  1. 一种基于企业征信大数据知识图谱的企业风险检测方法,包括:
    基于多个分散数据子域获得企业征信大数据统一信息模型;其中,所述企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构;
    通过所述层次化关键人员架构的企业信息和所述层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接;
    基于实现所述跨域连接的企业征信大数据统一信息模型,利用自顶向下方式构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体;
    基于所述第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中;
    利用所述企业征信大数据知识图谱进行企业特征数据获取,将获取的所述企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
  2. 根据权利要求1所述的方法,其中,所述企业征信大数据统一信息模型的层次化企业信息架构,包括:
    企业基本信息、企业人员信息、企业经营信息、企业资产信息、企业知识产权信息、企业财务信息、企业股权信息、司法数据、企业风险信息和辅助参考信息子域中的多种。
  3. 根据权利要求1或2所述的方法,其中,所述通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体,包括:
    通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取;
    基于所述实体抽取和关系抽取,识别出所述数据中的命名实体与关系实例,并对于未能识别出的命名实体与关系实例进行质量判定;
    基于所述质量判定确定质量排名,选取优质新词并扩展所述第一企业征信大数据领域本体,以构建所述第二企业征信大数据领域本体。
  4. 根据权利要求1至3中任一项所述的方法,其中,所述企业特征数据获取,包括:获取企业的基本属性特征、关联关系特征和研发创新能力特征;其中,
    从所述企业征信大数据知识图谱中获取所述企业的基本属性特征和所述企业的研发创新能力特征;以及,通过四类关系进行企业关系特征抽取,并通过最短路径算法以及社区发现算法,提取所述企业征信大数据知识图谱中的网络特征以获取所述企业的关联关系特征;其 中,所述四类关系包括参股关系、投资关系、交易关系和诉讼关系。
  5. 根据权利要求1至4中任一项所述的方法,其中,所述风控模型,包括:数据预处理、特征处理工程和结果分类。
  6. 根据权利要求5所述的方法,其中,所述数据预处理,包括:
    对获取的所述企业特征数据进行预处理,将日期型数据转化为字符型变量,然后对全部字符型变量进行转化,得到数值型数据,提取所述数值型数据的IV值、WOE、efficiency和rate。
  7. 根据权利要求6所述的方法,其中,所述IV值、WOE、efficiency和rate的公式为:
    Figure PCTCN2022087210-appb-100001
    Figure PCTCN2022087210-appb-100002
    Figure PCTCN2022087210-appb-100003
    Figure PCTCN2022087210-appb-100004
    其中,Good i和Bad i表示统计每个分箱里的未违约企业数和违约企业数,Good T和Bad T分别表示总的未违约企业数和违约企业数。
  8. 根据权利要求5至7中任一项所述的方法,其中,所述特征处理工程,包括:
    删除缺失值超过50%的特征、只含有唯一值的特征、和其他特征相关性高于60%的特征、在梯度增强器中特征重要性为0.0的特征,从所述梯度增强器中不贡献累积特征重要性99%的低重要性特征。
  9. 根据权利要求5至8中任一项所述的方法,其中,所述结果分类,包括:
    获取所述企业特征数据样本和企业标签;
    利用所述企业特征数据样本和企业标签有监督的训练LightGBM分类模型,得到训练好的LightGBM分类模型;
    将所述特征处理工程处理后的特征,输入所述训练好的LightGBM分类模型,进行计算分类得到分类结果;其中,所述分类结果分为违约与正常。
  10. 一种基于企业征信大数据知识图谱的企业风险检测装置,包括:
    信息获取模块,用于基于多个分散数据子域获得企业征信大数据统一信息模型;其中,所述企业征信大数据统一信息模型包括层次化企业信息架构和层次化关键人员架构;
    关系连接模块,用于通过所述层次化关键人员架构的企业信息和所述层次化企业信息架构的企业人员信息,提取关键人物与企业之间的关系,以实现企业征信大数据跨域连接;
    本体构建模块,用于基于实现所述跨域连接的企业征信大数据统一信息模型,利用自顶向下方式确定企业征信大数据领域并构建第一企业征信大数据领域本体;以及通过自底向上的构建方式,对所述企业征信大数据领域中的数据进行实体抽取和关系抽取,选取优质新词并扩充所述第一企业征信大数据领域本体规模,以构建第二企业征信大数据领域本体;
    图谱构建模块,用于基于所述第二企业征信大数据领域本体,利用企业征信大数据构建企业征信大数据知识图谱并存储在图数据库中;
    计算分类模块,用于利用所述企业征信大数据知识图谱进行企业特征数据获取,将获取的所述企业特征数据输入训练好的风控模型进行计算分类并输出分类结果。
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被处理器执行时实现如权利要求1至9中任一项所述的基于企业征信大数据知识图谱的企业风险检测方法。
  12. 一种非临时性计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至9中任一项所述的基于企业征信大数据知识图谱的企业风险检测方法。
  13. 一种计算机程序产品,包括计算机指令,所述计算机指令被至少一个处理器执行时实现如权利要求1至9中任一项所述的基于企业征信大数据知识图谱的企业风险检测方法。
PCT/CN2022/087210 2022-03-24 2022-04-15 基于企业征信大数据知识图谱的企业风险检测方法和装置 WO2023178767A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210302732.0A CN114817557A (zh) 2022-03-24 2022-03-24 基于企业征信大数据知识图谱的企业风险检测方法和装置
CN202210302732.0 2022-03-24

Publications (1)

Publication Number Publication Date
WO2023178767A1 true WO2023178767A1 (zh) 2023-09-28

Family

ID=82529928

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087210 WO2023178767A1 (zh) 2022-03-24 2022-04-15 基于企业征信大数据知识图谱的企业风险检测方法和装置

Country Status (2)

Country Link
CN (1) CN114817557A (zh)
WO (1) WO2023178767A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934963B (zh) * 2022-12-26 2023-07-18 深度(山东)数字科技集团有限公司 用于企业金融获客的商业汇票大数据分析方法及应用图谱

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131275A (zh) * 2020-09-23 2020-12-25 中国科学技术大学智慧城市研究院(芜湖) 全息城市大数据模型和知识图谱的企业画像构建方法
US20210166167A1 (en) * 2019-12-02 2021-06-03 Asia University Artificial intelligence and blockchain-based inter-enterprise credit rating and risk assessment method and system
CN113537796A (zh) * 2021-07-22 2021-10-22 大路网络科技有限公司 一种企业风险评估方法、装置及设备
CN114066242A (zh) * 2021-11-11 2022-02-18 北京道口金科科技有限公司 一种企业风险的预警方法及装置
CN114202223A (zh) * 2021-12-16 2022-03-18 深圳前海微众银行股份有限公司 企业信用风险评分方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210166167A1 (en) * 2019-12-02 2021-06-03 Asia University Artificial intelligence and blockchain-based inter-enterprise credit rating and risk assessment method and system
CN112131275A (zh) * 2020-09-23 2020-12-25 中国科学技术大学智慧城市研究院(芜湖) 全息城市大数据模型和知识图谱的企业画像构建方法
CN113537796A (zh) * 2021-07-22 2021-10-22 大路网络科技有限公司 一种企业风险评估方法、装置及设备
CN114066242A (zh) * 2021-11-11 2022-02-18 北京道口金科科技有限公司 一种企业风险的预警方法及装置
CN114202223A (zh) * 2021-12-16 2022-03-18 深圳前海微众银行股份有限公司 企业信用风险评分方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114817557A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2021103492A1 (zh) 一种企业经营风险预测方法和系统
US10878184B1 (en) Systems and methods for construction, maintenance, and improvement of knowledge representations
Baralis et al. Generalized association rule mining with constraints
WO2021175009A1 (zh) 预警事件图谱的构建方法、装置、设备及存储介质
CN111967761B (zh) 一种基于知识图谱的监控预警方法、装置及电子设备
CN104239553A (zh) 一种基于Map-Reduce框架的实体识别方法
WO2019196226A1 (zh) 制度信息查询方法、装置、计算机设备和存储介质
WO2023178767A1 (zh) 基于企业征信大数据知识图谱的企业风险检测方法和装置
CN108664509A (zh) 一种即席查询的方法、装置及服务器
Suri et al. Leveraging organizational resources to adapt models to new data modalities
Kanti Kumar et al. Application of graph mining algorithms for the analysis of web data
WO2020131004A1 (en) Domain-independent automated processing of free-form text
US11531703B2 (en) Determining data categorizations based on an ontology and a machine-learning model
US11720600B1 (en) Methods and apparatus for machine learning to produce improved data structures and classification within a database
CN117171711A (zh) 一种基于云平台的企业内外部数据融合共享方法及系统
CN116260866A (zh) 基于机器学习的政务信息推送方法、装置和计算机设备
CN116467291A (zh) 一种知识图谱存储与搜索方法及系统
CN115827885A (zh) 一种运维知识图谱的构建方法、装置及电子设备
CN114385845A (zh) 基于图聚类的影像分类管理方法及系统
CN114493853A (zh) 信用等级评价方法、装置、电子设备及存储介质
Nogueira et al. pytwanalysis: Twitter data management and analysis at scale
Shan [Retracted] Multisensor Cross‐Media Data Mining Method Assisted by Expert System
Yu et al. Workflow recommendation based on graph embedding
Wang et al. A framework for semantic connection based topic evolution with DeepWalk
KR102639876B1 (ko) 인공지능 기술을 활용한 주제별 유사 의미 키워드 분류사전 구축 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932826

Country of ref document: EP

Kind code of ref document: A1