Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a flowchart of an implementation of a method for credit rating based on a knowledge graph is provided in an embodiment of the present application, and the method is applied to an electronic device capable of data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for rating the credit of a target object such as a business or an individual.
Specifically, the method in this embodiment may include the following steps:
step 101: and obtaining the target corpus.
Wherein, the target language material comprises a plurality of sentences. For example, the target corpus is a news segment, or the target corpus is a summary report, or the target corpus is a speech manuscript, etc.
It should be noted that the sentences in the target corpus describe target objects to be rated, such as enterprises or individuals, and in addition, the sentences in the target corpus also describe contents related to the target objects. For example, the statements in the target corpus describe the related content of a certain enterprise in multiple information dimensions, such as the related content on the business status, the registered capital, the financial index, etc. in the business situation dimension, and further such as the related content on the corporate shareholder change, the external investment, etc. in the business information dimension.
Step 102: and extracting words of each sentence in the target corpus by using a pre-constructed knowledge graph to obtain a plurality of characteristic words corresponding to the target corpus.
The plurality of feature words corresponding to the target corpus comprise feature words of at least one target object on at least one information dimension. For example, the feature words corresponding to the target corpus include feature words of the enterprise a in the business situation dimension and feature words of the enterprise B in the business situation and the business information dimension.
Specifically, in this embodiment, a knowledge graph including a plurality of triple data may be pre-constructed, where the triple data may be a relational triple, such as an entity-relationship-entity triple, or an attribute-type triple, such as an entity-attribute-value triple, and the triple data includes a plurality of triples of enterprises in a plurality of information dimensions. Based on this, in this embodiment, the triple data in the knowledge graph is used to extract a word from each statement in the target corpus, and further extract a plurality of feature words corresponding to the target corpus, such as a relation triple that "enterprise a" has "investment" for "enterprise B", and then an attribute triple that "enterprise a" is a "sales" type company and has a sales amount of "100 ten thousand", and so on.
Step 103: and performing risk identification on the feature words of the target object on each information dimension by using the risk identification model corresponding to each information dimension to obtain a credit rating result corresponding to each information dimension of the target object.
The risk identification model is obtained by utilizing a plurality of training feature word sets with credit rating labels for training, and the credit rating result of the target object on one information dimension finally obtained represents the credit risk of the corresponding target object on the corresponding information dimension.
For example, in this embodiment, a plurality of risk identification models are pre-constructed, each risk identification model corresponds to one information dimension, such as a risk identification model corresponding to an operation status dimension and a risk identification model corresponding to an industrial and commercial information dimension, and the like, then the risk identification models in the corresponding information dimension are trained by using a plurality of training feature word sets with credit rating labels corresponding to each information dimension, the trained risk identification models can rate the credit of the target object in the corresponding information dimension to obtain a credit rating result of the target object in the information dimension, and the credit rating result of the target object in the information dimension can represent the credit risk of the target object in the corresponding information dimension, and if the credit rating result of the enterprise a in the operation status dimension represents that the credit risk of the enterprise a in the operation status dimension is higher, the credit rating result of the enterprise B in the industrial and commercial information dimension represents that the credit risk of the enterprise B in the industrial and commercial information dimension is low, and the like.
In an implementation manner, the risk identification model in this embodiment may be a deep learning model constructed based on a machine learning algorithm, such as a deep learning model constructed based on a convolutional Neural network cnn (convolutional Neural networks), or the like.
According to the above scheme, in the knowledge graph-based credit rating method provided in the embodiment of the present application, after a target corpus is obtained, a pre-constructed knowledge graph is used to perform word extraction on each sentence in the target corpus to obtain a plurality of feature words corresponding to the target corpus, where the feature words include feature words of at least one target object in at least one information dimension, and thus, a risk identification model corresponding to each information dimension is used to perform risk identification on the feature words of the target object in each information dimension to obtain a credit rating result corresponding to each information dimension of the target object, so as to represent the credit risk of the corresponding target object in the corresponding information dimension. Therefore, in the embodiment, the feature words in multiple information dimensions are extracted by using the knowledge graph, so that the feature content input into the deep learning model is enriched, and the accuracy of the obtained credit rating result is improved.
In one implementation, the knowledge-graph in the present embodiment may be obtained by the following method, as shown in fig. 2:
step 201: structured data stored in a relational database is read.
The relational database is a database storing structured data related to target objects, for example, the registration database includes structured data of stores, brands, users, and the like, and the structured data is related to at least one target object, such as a business, an individual, and the like.
Specifically, in this embodiment, structured data such as tables and columns in the relational database may be read in a stack or queue manner.
Step 202: and converting the structured data into ternary data by using a preset mapping relation between the structured data and the triples so as to obtain the knowledge graph.
In a specific implementation, the preset mapping relationship in this embodiment may be understood as a mapping specification mapped from the relational database to the semantic data, and specifically, a visual specification configuration tool may be used to configure the preset mapping relationship between the structured data and the triple. Specifically, in this embodiment, by analyzing the basic structure in the structured data and the structure of the triple of the knowledge graph, for example, analyzing the meaning of each table, the association between tables, the entity and the entity attribute in the triple, and the like, a preset mapping relationship between the structured data and the triple is configured, for example, the user table in the database corresponds to the concept of a person in the knowledge graph, the phone field in the table in the database corresponds to the attribute of the contact manner defined on the person in the knowledge graph, and the like. Based on this, when the structured data is converted into the triple data, the preset mapping relation is utilized to map the elements in the rows and columns in the table to the elements such as the entities, the entity relations or the entity attributes in the triple, so that the triple data is obtained, and the knowledge graph is further formed.
In one implementation, the knowledge-graph in this embodiment can be supplemented or enriched by the following means, as shown in fig. 3:
step 301: and acquiring a target page related to the target object in the industry website by using the preset words corresponding to the at least one target object.
In this embodiment, based on a preset seed vocabulary, that is, a preset word, which can represent the industry where the target object is located, a search engine or a search interface or the like may be used to perform a page search on an industry website (including a page of an industry knowledge base) to obtain a target page related to the target object.
In an implementation manner, the target page may include only a first page associated with a preset word, such as a page directly including the preset word, or the target page may further include a second page obtained by performing in-station acquisition on the first page, that is, a page corresponding to a link included in the first page, and so on.
Specifically, in this embodiment, a search engine or a search interface may be used to search for a first page including a preset word, and then the first page is acquired in-station, and the maximum depth of acquisition is set to 3 layers, that is, from the first page, a depth-first acquisition policy is used to acquire 3 layers in total. In other implementations, the acquisition depth may also be set to other values, such as 2-layer or 4-layer, etc.
Step 302: and reading the page content in the target page.
In this embodiment, a crawler or other technologies may be used to obtain page content in the target page to obtain content such as text therein.
Step 303: and generating ternary group data according to the page content to obtain the knowledge graph.
The page content can be subjected to triple extraction by using a pre-constructed and trained triple extraction model to obtain triple data, so that the knowledge graph is formed. The triple extraction model can be a model constructed based on a deep learning algorithm, and training is performed by using training sentence samples with triple labels, so that the trained triple extraction model can perform triple extraction on the sentences to obtain corresponding triple data, and the triple data is added to the knowledge graph.
In an implementation manner, the risk recognition model in this embodiment may be obtained by training in the following manner:
firstly, obtaining a plurality of training feature word sets with credit rating labels, wherein each training feature word set can be a feature word set obtained by utilizing a knowledge graph to extract words of sentences in corresponding training linguistic data;
it should be noted that the training feature word set herein includes training feature words on multiple information dimensions;
and then, taking the training feature words on each information dimension as input samples of the risk recognition model corresponding to the information dimension, taking the credit rating labels of the training feature word set as output samples of the risk recognition model on the information dimension, and training the risk recognition model.
Specifically, in this embodiment, the training feature words in each information dimension are input into the risk identification model corresponding to the corresponding information dimension, and a credit rating test result output by the risk identification model for the input training feature words is obtained, then, the credit rating test result is compared with the credit rating label, and the model parameters of the risk identification model are adjusted according to the difference value represented by the comparison result, so that the loss function of the risk identification model is reduced, and the ranking is performed until the loss function is converged, and the training is completed.
Further, in this embodiment, in order to improve the accuracy of the training samples, before the risk identification model is trained, the difficult samples are screened out. That is to say, the corpus participating in the risk model training is a sample corpus with higher accuracy, and at this time, the risk recognition model performs risk recognition on the training feature word set corresponding to the corpus to obtain a credit rating test result, and the difference between the credit rating test result and the credit rating label corresponding to the corpus is greater than or equal to the preset threshold.
In specific implementation, in this embodiment, the risk recognition model may be used to perform test training on the small-risk sample, and then after a preset threshold is obtained according to a test result, the preset threshold is used to screen out the training corpora participating in training, and after repeated iterative training of the corpora, the risk recognition model is finally obtained.
Referring to fig. 4, a schematic structural diagram of a knowledge-graph-based credit rating apparatus provided in the second embodiment of the present application, the apparatus being suitable for use in an electronic device capable of data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for rating the credit of a target object such as a business or an individual.
Specifically, the apparatus in this embodiment may include the following units:
a corpus obtaining unit 401, configured to obtain a target corpus, where the target corpus includes a plurality of sentences;
a feature extraction unit 402, configured to perform word extraction on each sentence in the target corpus by using a pre-constructed knowledge graph to obtain a plurality of feature words corresponding to the target corpus, where the feature words include feature words of at least one target object in at least one risk dimension;
a risk identification unit 403, configured to perform risk identification on the feature words of the target object in each risk dimension by using a risk identification model corresponding to each risk dimension to obtain a credit rating result corresponding to each risk dimension of the target object, where the risk identification model is obtained by using a plurality of training feature word sets with credit rating labels for training, and the credit rating result represents the credit risk level of the corresponding target object in the corresponding risk dimension.
According to the above scheme, in the knowledge graph-based credit rating device provided in the second embodiment of the present application, after the target corpus is obtained, the pre-constructed knowledge graph is used to perform word extraction on each sentence in the target corpus to obtain a plurality of feature words corresponding to the target corpus, where the feature words include feature words of at least one target object in at least one information dimension, and thus, a risk identification model corresponding to each information dimension is used to perform risk identification on the feature words of the target object in each information dimension to obtain a credit rating result corresponding to each information dimension of the target object, so as to represent the credit risk of the corresponding target object in the corresponding information dimension. Therefore, in the embodiment, the feature words in multiple information dimensions are extracted by using the knowledge graph, so that the feature content input into the deep learning model is enriched, and the accuracy of the obtained credit rating result is improved.
In one implementation, the apparatus in this embodiment may further include the following units, as shown in fig. 5:
a first graph construction unit 404, configured to read structured data stored in a relational database, the structured data being related to at least one target object; and converting the structured data into ternary data by using a preset mapping relation between the structured data and the triples so as to obtain the knowledge graph.
The second map building unit 405 is configured to obtain a target page related to at least one target object in an industry website by using a preset word corresponding to the target object; reading page content in the target page; and generating ternary group data according to the page content to obtain the knowledge graph.
Optionally, the target page at least includes a first page associated with the preset word and a second page obtained by performing in-station acquisition on the first page.
In another implementation, the apparatus in this embodiment may further include the following units, as shown in fig. 6:
a model training unit 406, configured to obtain a plurality of training feature word sets with credit rating labels; the training feature word set is a feature word set obtained by utilizing the knowledge graph to extract words of sentences in the training corpus; the training feature word set comprises training feature words on at least one information dimension; and taking the training feature words on each information dimension as input samples of corresponding risk recognition models, taking the credit rating labels of the training feature word set as output samples of the risk recognition models, and training the risk recognition models.
Optionally, the risk recognition model performs risk recognition on the training feature word set corresponding to the training corpus to obtain a credit rating test result, and a difference between the credit rating test result and the credit rating label corresponding to the training corpus is greater than or equal to a preset threshold.
It should be noted that, for the specific implementation of each unit in the present embodiment, reference may be made to the corresponding content in the foregoing, and details are not described here.
Referring to fig. 7, a schematic structural diagram of an electronic device according to a third embodiment of the present disclosure is provided, where the electronic device may be an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for rating the credit of a target object such as a business or an individual.
Specifically, the electronic device in this embodiment may include the following structure:
a memory 701 for storing an application program and data generated by the application program;
a processor 702 for executing the application to implement: obtaining a target corpus, wherein the target corpus comprises a plurality of sentences; extracting words of each sentence in the target corpus by using a pre-constructed knowledge graph to obtain a plurality of characteristic words corresponding to the target corpus, wherein the characteristic words comprise characteristic words of at least one target object in at least one risk dimension; and performing risk identification on the feature words of the target object in each risk dimension by using a risk identification model corresponding to each risk dimension to obtain a credit rating result of the target object in each risk dimension, wherein the risk identification model is obtained by training a plurality of training feature word sets with credit rating labels, and the credit rating result represents the credit risk of the corresponding target object in the corresponding risk dimension.
According to the above scheme, in the electronic device provided in the third embodiment of the present application, after the target corpus is obtained, the pre-constructed knowledge graph is used to perform word extraction on each sentence in the target corpus to obtain a plurality of feature words corresponding to the target corpus, where the feature words include feature words of at least one target object in at least one information dimension, and thus, a risk identification model corresponding to each information dimension is used to perform risk identification on the feature words of the target object in each information dimension to obtain a credit rating result corresponding to each information dimension of the target object, so as to represent the credit risk of the corresponding target object in the corresponding information dimension. Therefore, in the embodiment, the feature words in multiple information dimensions are extracted by using the knowledge graph, so that the feature content input into the deep learning model is enriched, and the accuracy of the obtained credit rating result is improved.
It should be noted that, in the present embodiment, reference may be made to the corresponding contents in the foregoing, and details are not described here.
Taking the enterprise credit rating by using the technical scheme of the application as an example, the technical scheme of the application is exemplified:
firstly, the semantic representation and understanding problem of multi-source heterogeneous data is solved by introducing a knowledge graph technology, and the credit scoring effectiveness of large-data enterprises is improved. Specifically, the implementation of the technical scheme of the application is mainly divided into two parts: and constructing an enterprise knowledge graph and realizing a credit scoring system based on the enterprise knowledge graph. The following were used:
1. construction of enterprise knowledge graph
The construction of the enterprise knowledge graph basically adopts structured data related to enterprises and businesses and various vertical site data in the Internet as data sources. Has the following characteristics:
(1) the industry coverage is wide and the industry depth is considerable. The data sources are all from data which are strongly related to the enterprise, and the data relevance is closely combined with the enterprise;
(2) the reliability is high: the internal structured data of the enterprise is usually used for supporting the business of the enterprise, so the reliability is very high; the enterprise data is stored in the relational database, and the structured ternary group data can be obtained only by converting the relational data to a certain degree, so that the reliability is good.
(3) The structure is strong: for internal structured data, the vast majority are stored via relational databases; the open industry data is basically edited and published by a high-quality website, and the structure is good.
When the enterprise knowledge graph is constructed, a data mode can be predefined, and a top-down knowledge graph mode is adopted. The data pattern is the most core part in the knowledge graph, and after the data pattern is defined, the data layer can be filled from various data sources. The method comprises the following specific steps:
1) converting the database to the triples:
the present application proposes a set of mapping specifications for mapping from a relational database to semantic data, i.e. the preset mapping relation in the foregoing may be named as D2RML (relationship database to RDF mapping language), and the specifications are described using XML language; based on the usability and universality of the XML language, the D2RML can be easily understood and used by common users; when the language is used, the user is not required to use related knowledge such as resource Description framework RDF (resource Description framework) and the like, so that the use threshold is reduced. In addition, the application also provides a visual standard configuration tool, and a user can complete the formulation of the mapping rule only by some simple configurations on the tool.
The main keywords and corresponding description functions in D2RML are as follows:
(a) dbtype is the type of a source database, such as mysql, oracle, sqlserver and the like, and determines the drive used in connection;
(b) dburl: the database is connected with a character string, and information such as the address, the port and the used database of the database is appointed.
(c) dbuser: a user name of the database;
(d) dbpwd: a password for the database;
(e) table: a source data table;
(f) concept: importing a target concept;
(g) name colname attribute of name: an entity name source column;
(h) the colname attribute of synonym: a synonymous entity source column;
(i) parent's tabename attribute: table names of parent concepts;
(j) the colname of attribute specifies the attribute source column, and attrname specifies the attribute name.
For example, one mapping file is as follows:
when the mapping conversion of the knowledge graph triples is carried out from the structured data, the basic structure in the structured data, including the meaning of each table and the association between the tables, is firstly analyzed, the structure of the knowledge graph is simultaneously analyzed, and then the tables in the structured data are associated with the concepts or entities in the knowledge graph by using a D2RML language, so that the conversion is realized.
2) Structured data knowledge mapping
After the mapping configuration file is defined, the triplets of the knowledge graph can be converted from the database according to the configured mapping relation. In this embodiment, a knowledge transformation engine may connect a target database configured in a configuration file, read structured data in a corresponding table, map data of tables and columns in the database into entities of concepts and attributes of the entities, respectively, and then store knowledge obtained by mapping into a knowledge graph.
3) Internet data collection and mapping
In order to enrich the knowledge graph, the application provides an industry knowledge base and industry website automatic discovery algorithm based on a search engine and an online encyclopedia, so that more triples related to each enterprise are mined and enriched in the knowledge graph.
The page acquisition and content acquisition are realized by the following algorithm flow, as shown in fig. 8:
(1) the search is carried out in a search engine and a search interface of an online encyclopedia by utilizing seed vocabularies which can represent industries. For the webpage documents returned by the search engine, the result of certain data arranged in the front is selected and directly added to the target webpage list. For the pages returned by encyclopedia, the corresponding article pages are entered, and then two types of links, namely common external links and external links of reference documents, are searched in the article pages and are added to the target webpage list.
(2) And classifying the target web pages in the target web page list according to the websites, wherein the acquisition strategies of different pages are different, such as list pages, detail pages and other pages.
(3) And performing in-station acquisition on the obtained webpage, wherein the maximum acquisition depth is set to be 3 layers, namely, starting from the first page, a depth-first acquisition strategy is used, and 3 layers are acquired in total.
(4) Analyzing the content of the websites, and extracting and storing the content of the webpage acquired by each website; for the content of the website, if the frequency of containing the industry keywords is high, the content of the website is related to the industry, the content is selected as a target data source, otherwise, the content only contains a small number of examples and is abandoned, and finally, a corresponding triple is generated by the stored content and is added to a knowledge graph.
2. Credit scoring system implementation based on enterprise knowledge graph
In the application, a credit scoring system based on an enterprise knowledge graph, namely a risk identification model in the preceding text, can be constructed based on a convolutional neural network, wherein the convolutional neural network can automatically extract features for input sentences or image data, perform classification tasks, extract more features to be used as input for next training in natural language processing, the Convolutional Neural Network (CNN) is generally used for natural language processing tasks such as character-level information modeling, the current word is connected with the preceding and following Chinese characters by using window sliding on word vectors of the input words through the CNN, the influence of the preceding and following words on the current word is calculated, and the generated words represent word features. In the present application, the term "convolutional neural network" is taken as an example, and the CNN layer structure is shown in fig. 9. After the convolution is finished, context information between the characters is extracted, expression characteristics of words and sentences are generated, and then the expression characteristics are input into a lower-layer neural network.
It should be noted that, in the training of a machine learning model (i.e., a risk recognition model) for risk recognition, a conventional machine learning algorithm often encounters a problem that cannot be solved, that is, the risk sample data is insufficient, and the features that can be extracted are limited. In a normal production environment, harmless data is far larger than harmful data, and a traditional machine learning algorithm based on statistics can obtain an ideal recognition model only under the training of a large amount of high-quality sample data. The idea of a risk identification model based on a deep machine learning DBN (deep Belief network) algorithm is that finite harmless data can be used for training, and multi-dimensional and multi-level learning is performed through iteration of a multi-layer neural network RBM (verified Boltzmann machine), so that the number of features can be obtained through rapid increase of learning.
The method comprises the steps of firstly training a small risk sample based on a deep machine learning DBN algorithm, obtaining an accurate sample by taking a threshold value, then training the accurate test sample by using the DBN again, and repeating iteration in the way to finally obtain a final risk identification model.
In combination with the design architecture diagram of the enterprise credit scoring system shown in fig. 10, the mainstream big data product is fully combined in the application, so that the usability, flexibility and expandability of the product are ensured. The application layer adopts interface development to provide a series of service capabilities, and simultaneously ensures the simplicity and the expandability of deployment.
The whole system can execute single-machine and distributed deployment, and utilizes the map to realize enterprise risk assessment and credit scoring and change event prompts; and newly adding an enterprise entity by using a knowledge graph unified data interface, establishing a risk relationship, and calculating information such as graph enterprise total credit, risk early warning trend and the like by using a distributed asynchronous algorithm. The method comprises the following specific steps:
the system can be built based on a cloud host, an independent server or a third-party virtual host, and is based on one or more databases of MSSQL, MySQL, orade and the like;
processing such as storage, caching, self-defined functions, transaction processing, reading and writing of a database and the like is realized in a data layer;
the method comprises the steps of constructing a knowledge graph in a business layer, namely constructing the graph by using a mapping rule, describing enterprises and events related to the enterprises based on the graph, such as credit evaluation, monitoring enterprises, monitoring statistics, event lists, rule configuration and the like, wherein CNN and DBN are the main implementation of business evaluation in the business layer, and asynchronous calculation is performed between a data layer and the business layer;
processing such as template engine rendering and request receiving in the display layer;
an interactive interface is provided for a user in the form of hypertext Markup language html (hypertext Markup language), cascading Style sheets cs (screening styles sheets), jQuery and pictures at a front end UI (user interface).
Therefore, the method and the system can deeply analyze the business state and public opinion trend of the enterprise, comprehensively depict all dimension information of each enterprise through the knowledge graph, and realize timely and effective credit scoring based on all dimension data of the enterprise. Moreover, according to the method and the system, based on enterprise characteristics and relevant historical negative samples and complaint samples, through artificial intelligence technologies such as deep learning, knowledge maps and natural language processing, deep correlation analysis and risk characteristic extraction are carried out on enterprise operation management data such as engineering project management, marketing management, material management and clean government construction, automatic intelligent identification of enterprise internal operation management risks is achieved, risk identification accuracy is improved, and risk control level is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.