CN114880486A - Industry chain identification method and system based on NLP and knowledge graph - Google Patents

Industry chain identification method and system based on NLP and knowledge graph Download PDF

Info

Publication number
CN114880486A
CN114880486A CN202210519609.4A CN202210519609A CN114880486A CN 114880486 A CN114880486 A CN 114880486A CN 202210519609 A CN202210519609 A CN 202210519609A CN 114880486 A CN114880486 A CN 114880486A
Authority
CN
China
Prior art keywords
enterprise
chain
keywords
industrial chain
enterprises
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210519609.4A
Other languages
Chinese (zh)
Inventor
刘丽萍
陈康
袁轶慧
齐宁
朱巍
周云松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu United Credit Reference Co ltd
Original Assignee
Jiangsu United Credit Reference Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu United Credit Reference Co ltd filed Critical Jiangsu United Credit Reference Co ltd
Priority to CN202210519609.4A priority Critical patent/CN114880486A/en
Publication of CN114880486A publication Critical patent/CN114880486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention provides an industrial chain identification method and system based on NLP technology and knowledge graph, comprising: aiming at a specific industry chain, extracting a candidate word of the industry chain; aiming at the candidate words, multi-dimensionally constructing the characteristics of the candidate words, calculating scores, fusing the scores and outputting final keywords; clustering the keywords by combining domain knowledge and national economy industry classification; screening enterprises containing keywords, and determining the upstream and downstream positions of the enterprises according to the keyword category labels hit by the enterprises; aiming at enterprises in an industrial chain, multi-dimensional characteristics are constructed, and an industrial chain evaluation score model is built; and outputting a core enterprise list of the middle and the downstream of the industrial chain based on the enterprise scoring result and the upstream and downstream positions of the enterprise in the industrial chain. By means of an NLP technology, keywords for positioning upstream and downstream main business of an industrial chain are extracted, and then upstream and downstream enterprises of the industrial chain are screened; and the core enterprises in the industrial chain can be identified by combining the knowledge graph technology.

Description

Industry chain identification method and system based on NLP and knowledge graph
Technical Field
The invention belongs to the technical field of data processing, relates to NLP and knowledge graph technology, and particularly relates to a method and a system for identifying upstream and downstream enterprises in an industrial chain by using NLP and knowledge graph.
Background
The industrial chain is a chain type incidence relation form objectively formed based on certain technical and economic association among various industrial departments according to a specific logical relation and a space-time layout relation, and is essentially used for describing an enterprise group structure with certain internal association. In the industrial chain, exchange of upstream and downstream relations and mutual values exists in a large number, an upstream link conveys products or services to a downstream link, and a downstream link feeds back information to the upstream link. If some enterprises have business problems, risk propagation and linkage influence can be caused to the enterprises which are upstream and downstream in the industry chain. Therefore, there is a need for analysis and identification of the industry chain. However, in the prior art, a related scheme capable of accurately identifying an industrial chain is not provided.
Disclosure of Invention
In order to solve the problems, the invention provides an industrial chain identification method and system based on NLP technology and knowledge graph, which can identify upstream and downstream enterprises of the industrial chain.
In order to achieve the purpose, the technical scheme of the invention is as follows:
an industrial chain identification method based on NLP technology and knowledge graph comprises the following steps:
step one, aiming at a specific industrial chain, screening an enterprise list of which the 'operating range' information contains industrial chain related fields from enterprise industrial and commercial information; performing word segmentation processing and cleaning on the information of the enterprise 'operating range' by using a word segmentation technology, eliminating stop words, and extracting candidate words of the industry chain;
step two, constructing characteristics of candidate words from multiple dimensions and calculating scores of the candidate words aiming at the candidate words extracted in the step one based on an NLP unsupervised learning algorithm, fusing the scores in a linear weighting mode, and outputting candidate words with total scores of TOP-K as final keywords; the dimension comprises TF-IDF values of the candidate words, main information characteristics of the candidate words, position information characteristics of the candidate words and similarity between the candidate words and the industrial chain theme;
the TF-IDF value of a candidate word is calculated by:
TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)
Wherein the content of the first and second substances,
Figure BDA0003642740640000021
Figure BDA0003642740640000022
the position information characteristics of the candidate words are obtained according to the position indexes of the candidate words appearing in the text for the first time;
the fusion formula is as follows:
score=w 1 TFIDF value + w 2 Name main body information + w 3 Industry subject information + w 4 Position information + w 5 Similarity value
w 1- w 5 Quantifying the weight of the index for each feature;
thirdly, clustering analysis is carried out on the keywords by applying a clustering algorithm and combining domain knowledge and national economy industry classification, and keyword identification of upstream and downstream domains of an industrial chain is carried out; the method specifically comprises the following steps:
using Word vector table generated by Word2Vec to generate K key Word words generated in the step two as K high-dimensional vectors W 1 ,W 2 ,……W K Then, clustering the K word vectors by using a KMeans clustering algorithm; finding the optimal clustering category number and obtaining the clustering center of each category:
Figure BDA0003642740640000023
comparing the main key words of each category with the national economic industry categories, judging the correlation between the key words of each category and the upstream, the intermediate and the downstream fields, and then corresponding the key words of each category to the corresponding fields, thereby respectively identifying the key words for describing the main business of the industrial chain, the intermediate and the downstream fields;
step four, further screening enterprises with K keywords in the 'operating range' on the basis of the list of the candidate enterprises screened in the step one, and determining the upstream and downstream positions of the enterprises according to the main category labels of the keywords hit by the enterprises;
fifthly, constructing characteristics from multiple dimensions for enterprises in an industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises; the plurality of dimensions includes at least: enterprise basic information, loan information, judicial complaints, enterprise operations, enterprise fund transaction characteristics, external public opinion data, and relationship network/graph characteristics;
extracting the characteristics of each enterprise graph in the industry chain by using a graph calculation and community discovery algorithm: the characteristics of the entrance degree, the center degree, the Pagerank value, the centrifugation degree, the size of a subgraph and the size of a community; meanwhile, the community of each enterprise is output, the enterprise characteristics in each community are further analyzed, and the self strength of the enterprise and the influence of the enterprise on the associated enterprise are comprehensively evaluated by means of a knowledge graph;
constructing and calculating the multi-dimensional characteristics aiming at a sample enterprise, and training a generated two classifiers by using a Logistic regression model to obtain an industrial chain evaluation score model; the evaluation score of each enterprise in the industry chain is output by calling an industry chain evaluation score model;
step six, determining an enterprise list of each industrial chain based on the step four, and positioning whether each enterprise is at an upstream position, a midstream position or a downstream position in the industrial chain; based on the enterprise scoring result generated in the fifth step; and respectively screening out the enterprises with the grades ranked in the front from the upstream, the midstream and the downstream enterprises as a core enterprise list in the midstream and the downstream of the output industrial chain.
Further, the similarity between the candidate word in the second step and the topic of the industry chain is calculated in the following way:
converting the candidate words into Word vectors, performing Word2Vec Word vector model training on the speech, and further performing quantitative analysis on the similarity of the candidate words and the theme of the industry chain to obtain the similarity.
Further, the third step includes the following steps:
representing the keywords by using Word vectors generated by Word2Vec, and clustering the Word vectors by using a KMeans clustering algorithm; finding the optimal clustering category quantity according to an elbow method, and obtaining the clustering center of each category; thereby dividing the keywords into different categories, and selecting the clustering center of each category as a main keyword of the category; and combining domain knowledge and national economy industry classification according to the proximity degree of each category of keywords, and combining the category keywords from the angles of the upstream of the industrial chain, the midstream of the industrial chain and the downstream of the industrial chain respectively so as to identify the keywords for describing the main business of the upstream, the midstream and the downstream fields of the industrial chain.
Further, the fourth step includes the following steps: if the business scope of the enterprise hits a part of the K keywords and the hit keywords exceed a certain proportion and are contained in a business keyword list of a main business of a certain field position in the upstream and downstream, the enterprise is judged to belong to the field position.
Further, in the fifth step, the relationship network/graph characteristics are obtained by: and constructing an association relation network by means of knowledge graph technology based on stockholder relation, fund transaction, group relation, subsidiaries/branch companies, external investment, external guarantee, main personnel information and litigation relation subject data information among enterprises in the industry chain.
Further, the community discovery algorithm uses a Louvain algorithm, the optimization goal is to maximize the modularity of the whole data, and the modularity is calculated as follows:
Figure BDA0003642740640000031
where m is the total number of edges in the graph, k i Representing the sum of the connected-edge weights, k, of all pointing vertices i j In the same way, A ij Representing the weight of the connecting edge between vertices i and j.
Further, the industry chain evaluation score model is as follows:
Figure BDA0003642740640000032
where P is the probability that each enterprise is a core enterprise, x 1 ,x 2 ,……x k Is a k characteristic values, beta, constructed according to each dimension 0 Is an intercept term, beta 12 ……β k Is the coefficient of the k features.
The invention also provides an industrial chain identification system based on the NLP technology and the knowledge graph, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the industrial chain identification method based on the NLP technology and the knowledge graph when being loaded to the processor.
The invention has the beneficial effects that:
by means of an NLP technology, keywords for positioning upstream and downstream main business of an industrial chain are extracted, and then upstream and downstream enterprises of the industrial chain are screened; and then, a knowledge graph technology and a supervised machine learning algorithm are combined to construct an industry chain evaluation score model, and further core enterprises in the industry chain are identified. The method has the advantages that the adopted keyword extraction algorithm is strong in adaptability, the constructed dimensionality is rich and three-dimensional, the keyword output is accurate, the industry chain evaluation scoring model can give quantitative evaluation to the enterprise, and upstream and downstream enterprises of the industry chain can be accurately screened out by combining the upstream and downstream position positioning of the enterprise based on the keyword. The method can construct an incidence relation network among enterprises in an industrial chain, and can predict the propagation direction and the influence degree of the risk when the enterprises in the community have operation risks, thereby helping to identify potential risks.
Drawings
Fig. 1 is a flowchart of an industrial chain identification method based on NLP technology and knowledge graph provided by the present invention.
FIG. 2 is a schematic diagram of a platform architecture for implementing the present invention.
Fig. 3 is a schematic diagram of a process for constructing an industry chain evaluation score model according to the present invention.
Fig. 4 is a schematic diagram of keyword cluster analysis.
Fig. 5 is a schematic diagram of upstream and downstream core enterprise identification.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.
The invention provides an industrial chain identification method based on NLP technology and knowledge graph, the supported big data platform technology architecture mainly comprises a general data interface, a calculation frame, a general algorithm frame, data application and the like; the large data platform can support data storage of large data volume and efficient data query and data interaction, and the general algorithm framework adopts a distributed framework, supports large data volume testing and large-scale cluster deployment, and supports data exploration and analysis mining of massive structured and unstructured data. The flow of the industry chain identification method is shown in fig. 1, and comprises the following steps:
step one, aiming at a specific industrial chain, word segmentation technology is applied to perform word segmentation processing and cleaning on information of an enterprise 'operation range', and candidate words of the industrial chain are extracted. Taking the 'Internet of vehicles' industrial chain as an example, screening an enterprise list of which the 'operation range' information contains 'Internet of vehicles' fields from enterprise industrial and commercial information; then, word segmentation is carried out on the information of the 'operation range' of the selected enterprise by using a word segmentation technology to generate candidate words, wherein stop words such as 'ground', 'yes', 'o' and the like which appear at high frequency but do not help the identification of an industry chain need to be eliminated.
And step two, constructing the characteristics of the candidate words from multiple dimensions and calculating scores of the candidate words according to the candidate words extracted in the step one on the basis of an NLP unsupervised learning algorithm, fusing the scores in a linear weighting mode, and outputting the candidate words with total scores of TOP-K as final keywords.
In consideration of the fact that when a supervised algorithm is used for extracting keywords, a large amount of labeled data is needed, labor cost is too high, and a word list needs to be maintained in time, the unsupervised keyword extraction algorithm with high applicability is mainly adopted for extracting the keywords.
Regarding the characteristics of candidate words, the present invention is mainly constructed from the following dimensions:
(1) TF-IDF values of candidate words:
TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)
Wherein the content of the first and second substances,
Figure BDA0003642740640000051
Figure BDA0003642740640000052
in general, if a word appears frequently in one document and rarely appears in other documents, i.e., the TF-IDF value of the word is high, the word is considered to have a good category discrimination capability.
(2) Subject information characteristics of candidate words
Two main indicators are considered: whether the candidate appears in "business name"; whether the candidate appears in the "belonging industry class".
For a candidate word, if the candidate word appears in the field of 'enterprise name' or 'affiliated industry classification', the candidate word can be basically positioned to the main business of the enterprise; accordingly, the index may also be given higher weight when the multidimensional feature scores are fused. For example, in the field of raw material production upstream of the "car networking" industry chain, an enterprise with an enterprise name of "instruments and meters manufacturing limited", or an enterprise classified as "dedicated instruments and meters manufacturing" by the industry, the term "instruments and meters" can basically locate the main business of the enterprise.
Specifically, if the candidate word is contained in the "business name", the "name subject information" feature is 1, otherwise, the "name subject information" feature is 0; if the candidate word is contained in the "affiliated industry classification", the "industry subject information" feature is 1, otherwise, the "industry subject information" feature is 0.
(3) Position information characteristic of candidate word
Mainly consider the position index where the candidate word first appears in the text. Generally, candidate words at the beginning or the end (such as the first 10% or the last 10%) of the text are more representative, and the probability that the words describe the main business of the enterprise is higher, and a higher score should be given in terms of the position information characteristics. The specific method comprises the following steps of extracting a position index of a candidate word appearing in a text for the first time:
a) if the position index is within the front or rear 10%, the position information characteristic is 10;
b) if the position index is 10% -30% before or after, the position information characteristic is 5;
c) if the position index is 30% to 50% before or after, the position information characteristic is 1.
(4) Similarity of candidate words to industry chain topics
Taking the "internet of vehicles" industry chain as an example, to calculate the quantitative relationship between candidate words and the "internet of vehicles", the natural language of the words needs to be converted into mathematical information, and after high-dimensional space points are formed by introducing word vectors, the similarity between the words is evaluated by calculating the distance relationship between different points.
To convert words into Word vectors, the Word2Vec Word vector model may be employed. Word2Vec is a Word vector training tool, the tool utilizes a shallow neural network model to automatically learn the appearance of words in a corpus, the Word2Vec Word vector model training is performed on the corpus, the Word internet and candidate words generated in the first step can be embedded into a high-dimensional space, the words are expressed in the form of Word vectors in the high-dimensional space, and then the similarity between the candidate words and the Word internet of vehicles can be quantitatively analyzed. As Word2Vec calculates the cosine value (the value is between 0 and 1), the larger the value is, the higher the association degree between the candidate Word and the Internet of vehicles is. Through the method, the calculation of each dimension characteristic is completed, and all the characteristics are subjected to normalization processing (the values are unified within the range of 0-100). Then, the characteristic quantization indexes are fused in a linear weighting mode to obtain the score of each candidate word:
score=w 1 TFIDF value + w 2 Name body information + w 3 Industry subject information + w 4 Position information + w 5 Similarity value
Wherein, w 1- w 5 The weights of the indices are quantified for each feature. Finally, the candidate words with the total score of TOP-K (such as TOP50, the value of K can be adjusted according to the needs, and the empirical value is generally taken) are output as the final keywords.
And step three, clustering analysis is carried out on the keywords by applying a clustering algorithm and combining domain knowledge and national economy industry classification, and keyword identification in the upstream and downstream domains of the industrial chain is carried out.
Clustering algorithms aim to discover the relationships between data objects, grouping data such that the similarity within a group is as large as possible and the similarity between groups is as small as possible.
Specifically, according to the second step, after K keywords are generated, the keywords are generated by Word2VecQuantity means, i.e.: k high-dimensional vectors W 1 ,W 2 ,……W K Then, clustering the K word vectors by using a KMeans clustering algorithm; the optimal number of cluster classes can be found according to the Elbow Method (Elbow Method), e.g. into M classes (M)<K) And obtaining the clustering centers of all the categories, namely:
Figure BDA0003642740640000061
by the method, the keywords can be divided into different categories, the similarity of the keywords in each category is high, the keywords are mainly concentrated in the same industry field, and meanwhile, the clustering center of each category can be selected as one main keyword of the category.
Next, the main keywords of each category (i.e., category center)
Figure BDA0003642740640000071
) Comparing with national economic industry classification, judging whether each category keyword is related to upstream production data, manufacturing and research and development of midstream, or downstream sales and services by combining with domain knowledge, and then corresponding each category keyword to a corresponding domain, thereby identifying the keywords describing the main business of the upper, middle and downstream domains of the industrial chain, namely: the K keywords are subdivided according to the upper, middle and lower fields of the industry chain to obtain keywords describing main businesses of the upper, middle and lower fields, which are n1, n2 and n3 respectively, as shown in fig. 4.
And step four, further screening enterprises with K keywords (obtained in the step two) in the 'operating range' on the basis of the list of the candidate enterprises screened in the step one, and determining the upstream and downstream positions of the enterprises according to the main category labels of the keywords hit by the enterprises.
Specifically, the "business scope" of an enterprise only hits a part of the K keywords (for example, L < K), and most (for example, more than 70%, which may be adjusted as needed) of the L keywords are included in the n2 keyword lists describing the main business in the midstream domain, so that the enterprise is determined to be in the midstream position of the industry chain.
Based on the steps, the enterprise list in each industry chain is finally determined, and the upstream and downstream positions of each enterprise in the industry chain are located.
And fifthly, constructing characteristics from multiple dimensions for enterprises in the industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises.
The primary dimension data includes the following aspects:
1. basic information of the enterprise: enterprise scale, affiliated industry, industry ranking, registered capital, social insurance payment information, and the like.
2. Loan information: loan scale, loan tendency, repayment behavior, etc.
3. The judicial complaints are: type of complaints, amount of litigation objective, whether or not to be executed/loss of confidence/limit, etc.
4. Enterprise management: asset liability, administrative permissions/penalties, tax ratings, bids, software copyrights, patent applications, internet recruitment, honor qualifications, and time series characteristics of the above, and the like.
5. Enterprise fund transaction characteristics: time, region, frequency, number of strokes and amount of money, counterparty of transaction, etc. of the roll-in and roll-out.
6. External public opinion data: company newspaper, news, financial newspaper, etc.
7. Relational network/graph features.
The relational network/graph characteristic is that an association relational network is constructed by means of knowledge graph technology based on data information such as stockholder relations, fund transaction transactions, group relations, subsidiaries/affiliates, external investment, external guarantee, main personnel information (director of the board of directors), litigation relation subjects and the like among enterprises in an industry chain.
By using graph calculation and community discovery algorithm, the characteristics of each enterprise graph in the industry chain can be extracted: the entrance degree, the center degree, the Pagerank value, the centrifugation degree, the size of a subgraph, the size of a community and other characteristics; meanwhile, communities in which each enterprise is located can be output, and enterprise characteristics (enterprise scale, registered capital, industry ranking and the like) in each community can be further analyzed. By means of the knowledge graph, the self strength of the enterprise and the influence of the enterprise on the related enterprises can be comprehensively evaluated.
The community discovery algorithm mainly uses a Louvain algorithm, the Louvain algorithm is a community algorithm based on modularity, the optimization goal of the community discovery algorithm is to maximize the modularity of the whole data, and the modularity is calculated as follows:
Figure BDA0003642740640000081
where m is the total number of edges in the graph, k i Representing the sum of the connected-edge weights, k, of all pointing vertices i j In the same way, A ij Represents the weight of the connecting edge between vertices i and j, c i Denotes the community, δ (c), to which vertex i is assigned i ,c j ) And the method is used for judging whether the vertex i and the vertex j are divided in the same community, if so, returning to 1, and otherwise, returning to 0.
The industry chain evaluation score model is a classifier which is constructed and calculated aiming at sample enterprises (enterprises which are verified and have labels in an industry chain, namely, when the enterprises are core enterprises of the industry chain, table is 1, otherwise label is 0), and the multidimensional characteristics are trained by applying a Logistic regression model;
Figure BDA0003642740640000082
where P is the probability that each enterprise is a core enterprise, x 1 ,x 2 ,……x k Is a k characteristic values, beta, constructed according to the above dimensions 0 Is an intercept term, beta 12 ……β k Is the coefficient of the k features.
The whole model training and predicting process mainly comprises the following aspects (see fig. 3 for details):
1) data access: most data are stored in a big data platform and can be directly accessed in an SQL mode.
2) Characteristic processing: and carrying out characteristic processing on the accessed data to form characteristic index data, integrating the characteristic index data into a storage process, and encapsulating the characteristic index data in a shell script to facilitate regular calling and calculation.
3) Model training: based on the processed features and sample labels, model training is performed by adopting a supervised machine learning algorithm or a graph algorithm in an algorithm frame, then the model is exported (for example, a PMML file), and the model is deployed online.
4) Batch prediction: inputting the characteristic data of the enterprise to be predicted into the model, and outputting a prediction result after model processing.
By calling the industry chain evaluation score model, the evaluation score (probability value, value is between 0 and 1) of each enterprise in the industry chain can be output, and the higher the score is, the stronger the comprehensive strength of the enterprise is, and the higher the possibility of becoming the core enterprise of the industry chain is.
Step six, based on step four, the enterprise list of each industry chain can be determined, and whether each enterprise is in an upstream, midstream or downstream position in the industry chain is positioned. Based on the fifth step, corresponding enterprise scoring results are generated, and then, enterprises with scoring ranking TOP-N (for example, TOP-10) are respectively screened out from the upstream, midstream and downstream enterprises as a core enterprise list in the middle and downstream of the industry chain, so that typical companies in the industry can be identified, as shown in fig. 5.
The method of the present invention is implemented based on a platform architecture as shown in fig. 2. The method comprises the following steps:
and the data layer comprises basic information data of enterprises and businesses, loan data, business data, judicial complaint data and the like.
And the cleaning layer is used for cleaning and fusing data.
And the index layer processes the characteristic indexes from different dimensions and periodically updates the characteristic indexes in an iterative manner.
And the model layer comprises related algorithms such as NLP extraction keywords, word vector representation and the like, a knowledge graph algorithm and a common machine learning algorithm.
And the application layer comprises modules of industry chain identification, typical enterprise identification, enterprise risk conduction and the like.
Based on the same inventive concept, the embodiment of the present invention further provides an industry chain identification system based on the NLP technology and the knowledge graph, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the computer program is loaded into the processor, the industry chain identification system based on the NLP technology and the knowledge graph implements the above-mentioned industry chain identification method based on the NLP technology and the knowledge graph. The system comprises an industrial chain keyword extraction module and an industrial chain enterprise identification module, wherein the industrial chain keyword extraction module comprises an industrial chain candidate word extraction unit, a keyword output unit and a cluster analysis unit, the industrial chain candidate word extraction unit is used for realizing the function of the first step, the keyword output unit is used for realizing the function of the second step, the cluster analysis unit is used for realizing the function of the third step, the industrial chain enterprise identification module comprises an industrial chain enterprise identification unit, an industrial chain evaluation scoring unit and an industrial chain enterprise output unit, the industrial chain enterprise identification unit is used for realizing the function of the fourth step, the industrial chain evaluation scoring unit is used for realizing the function of the fifth step, and the industrial chain enterprise output unit is used for realizing the function of the sixth step.
It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims (8)

1. An industrial chain identification method based on NLP technology and knowledge graph is characterized by comprising the following steps:
step one, aiming at a specific industrial chain, screening an enterprise list of which the 'operating range' information contains industrial chain related fields from enterprise industrial and commercial information; performing word segmentation processing and cleaning on the information of the enterprise 'operating range' by using a word segmentation technology, eliminating stop words, and extracting candidate words of the industry chain;
step two, constructing characteristics of candidate words from multiple dimensions and calculating scores of the candidate words aiming at the candidate words extracted in the step one based on an NLP unsupervised learning algorithm, fusing the scores in a linear weighting mode, and outputting candidate words with total scores of TOP-K as final keywords; the dimension comprises TF-IDF values of the candidate words, main information characteristics of the candidate words, position information characteristics of the candidate words and similarity between the candidate words and the industrial chain theme;
the TF-IDF value of a candidate word is calculated by:
TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)
Wherein the content of the first and second substances,
Figure FDA0003642740630000011
Figure FDA0003642740630000012
the main information characteristics of the candidate words are obtained based on whether the candidate words appear in the enterprise names or the corresponding characteristics given in the industry;
the position information characteristics of the candidate words are obtained according to the position indexes of the candidate words appearing in the text for the first time;
the fusion formula is as follows:
score=w 1 TFIDF value + w 2 Name body information + w 3 Industry subject information + w 4 Position information + w 5 Similarity value
w 1 -w 5 Quantifying the weight of the index for each feature;
thirdly, clustering analysis is carried out on the keywords by applying a clustering algorithm and combining domain knowledge and national economy industry classification, and keyword identification of upstream and downstream domains of an industrial chain is carried out; the method specifically comprises the following steps:
using Word vector table generated by Word2Vec for K keyword keywords generated in the step two as K high-dimensional vectors W 1 ,W 2 ,……W K Then, clustering the K word vectors by using a KMeans clustering algorithm; finding out the optimal cluster category number and obtaining the cluster of each categoryCenter:
Figure FDA0003642740630000013
comparing the main keywords of each category with the national economy industry classification, judging the correlation between the keywords of each category and the upstream, the middle and the downstream fields, and then corresponding the keywords of each category to the corresponding fields so as to respectively identify the keywords for describing the main business of the upstream, the middle and the downstream fields of the industrial chain;
step four, further screening enterprises with K keywords in the 'operating range' on the basis of the list of the candidate enterprises screened in the step one, and determining the upstream and downstream positions of the enterprises according to the main category labels of the keywords hit by the enterprises;
fifthly, constructing characteristics from multiple dimensions for enterprises in the industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises; the plurality of dimensions includes at least: enterprise basic information, loan information, judicial complaints, enterprise operations, enterprise fund transaction characteristics, external public opinion data, and relationship network/graph characteristics;
extracting the characteristics of each enterprise graph in the industry chain by using a graph calculation and community discovery algorithm: the characteristics of the entrance degree, the center degree, the Pagerank value, the centrifugation degree, the size of a subgraph and the size of a community; meanwhile, the community of each enterprise is output, the enterprise characteristics in each community are further analyzed, and the self strength of the enterprise and the influence of the enterprise on the associated enterprise are comprehensively evaluated by means of a knowledge graph;
constructing and calculating the multi-dimensional characteristics aiming at a sample enterprise, and training a generated two classifiers by using a Logistic regression model to obtain an industrial chain evaluation score model; the evaluation score of each enterprise in the industry chain is output by calling an industry chain evaluation score model;
step six, determining an enterprise list of each industrial chain based on the step four, and positioning whether each enterprise is at an upstream position, a midstream position or a downstream position in the industrial chain; based on the enterprise scoring result generated in the fifth step; and respectively screening out the enterprises with the grades ranked in the front from the upstream, the midstream and the downstream enterprises as a core enterprise list in the midstream and the downstream of the output industrial chain.
2. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the similarity between the candidate words and the industry chain topics in the second step is calculated by:
converting the candidate words into Word vectors, performing Word2Vec Word vector model training on the speech, and further performing quantitative analysis on the similarity of the candidate words and the theme of the industry chain to obtain the similarity.
3. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the third step comprises the following processes:
representing the keywords by using Word vectors generated by Word2Vec, and clustering the Word vectors by using a KMeans clustering algorithm; finding the optimal clustering category quantity according to an elbow method, and obtaining the clustering center of each category; thereby dividing the keywords into different categories, and selecting the clustering center of each category as a main keyword of the category; and combining domain knowledge and national economy industry classification according to the proximity degree of each category of keywords, and combining the category keywords from the angles of the upstream of the industrial chain, the midstream of the industrial chain and the downstream of the industrial chain respectively so as to identify the keywords for describing the main business of the upstream, the midstream and the downstream fields of the industrial chain.
4. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the step four comprises the following processes: if the business scope of the enterprise hits a part of the K keywords and the hit keywords exceed a certain proportion and are contained in a business keyword list of a main business of a certain field position in the upstream and downstream, the enterprise is judged to belong to the field position.
5. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein in the fifth step, the relationship network/graph characteristics are obtained by: and constructing an association relation network by means of knowledge graph technology based on stockholder relation, fund transaction, group relation, subsidiaries/branch companies, external investment, external guarantee, main personnel information and litigation relation subject data information among enterprises in the industry chain.
6. The NLP and knowledge-graph based industry chain identification method of claim 1, wherein the community discovery algorithm uses the Louvain algorithm, the optimization goal is to maximize the modularity of the whole data, and the modularity is calculated as follows:
Figure FDA0003642740630000031
where m is the total number of edges in the graph, k i Representing the sum of the connected-edge weights, k, of all pointing vertices i j In the same way, A ij Representing the weight of the connecting edge between vertices i and j.
7. The NLP and knowledge-graph based industry chain identification method of claim 1, wherein the industry chain evaluation score model is as follows:
Figure FDA0003642740630000032
where P is the probability that each enterprise is a core enterprise, x 1 ,x 2 ,……x k Is a k characteristic values, beta, constructed according to each dimension 0 Is an intercept term, beta 12 ……β k Is the coefficient of the k features.
8. An NLP technology and knowledge-graph based industry chain identification system, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing the NLP technology and knowledge-graph based industry chain identification method of any one of claims 1 to 7.
CN202210519609.4A 2022-05-13 2022-05-13 Industry chain identification method and system based on NLP and knowledge graph Pending CN114880486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210519609.4A CN114880486A (en) 2022-05-13 2022-05-13 Industry chain identification method and system based on NLP and knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210519609.4A CN114880486A (en) 2022-05-13 2022-05-13 Industry chain identification method and system based on NLP and knowledge graph

Publications (1)

Publication Number Publication Date
CN114880486A true CN114880486A (en) 2022-08-09

Family

ID=82675473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210519609.4A Pending CN114880486A (en) 2022-05-13 2022-05-13 Industry chain identification method and system based on NLP and knowledge graph

Country Status (1)

Country Link
CN (1) CN114880486A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641202A (en) * 2022-10-28 2023-01-24 中山大学 Small loan industry group lending risk measurement method based on knowledge graph and graph calculation
CN115934968A (en) * 2023-01-06 2023-04-07 广州探迹科技有限公司 Industrial chain information construction method and device and storage medium
CN116663751A (en) * 2023-07-31 2023-08-29 北京市科学技术研究院 Three-network industry map construction method and system based on future industry enterprises
CN117291428A (en) * 2023-11-17 2023-12-26 南京雅利恒互联科技有限公司 Enterprise management APP-based data background management system
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 Industrial chain construction method and system
CN117634991A (en) * 2023-11-09 2024-03-01 北京云泽卓越科技发展有限公司 Industry chain model generation method and system
CN117670145A (en) * 2024-01-26 2024-03-08 中国标准化研究院 Knowledge graph-based industrial chain development quality evaluation method and system
CN117633518B (en) * 2024-01-25 2024-04-26 北京大学 Industrial chain construction method and system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641202A (en) * 2022-10-28 2023-01-24 中山大学 Small loan industry group lending risk measurement method based on knowledge graph and graph calculation
CN115934968A (en) * 2023-01-06 2023-04-07 广州探迹科技有限公司 Industrial chain information construction method and device and storage medium
CN116663751A (en) * 2023-07-31 2023-08-29 北京市科学技术研究院 Three-network industry map construction method and system based on future industry enterprises
CN117634991A (en) * 2023-11-09 2024-03-01 北京云泽卓越科技发展有限公司 Industry chain model generation method and system
CN117291428A (en) * 2023-11-17 2023-12-26 南京雅利恒互联科技有限公司 Enterprise management APP-based data background management system
CN117291428B (en) * 2023-11-17 2024-03-08 南京雅利恒互联科技有限公司 Enterprise management APP-based data background management system
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 Industrial chain construction method and system
CN117633518B (en) * 2024-01-25 2024-04-26 北京大学 Industrial chain construction method and system
CN117670145A (en) * 2024-01-26 2024-03-08 中国标准化研究院 Knowledge graph-based industrial chain development quality evaluation method and system

Similar Documents

Publication Publication Date Title
Roy et al. A Machine Learning approach for automation of Resume Recommendation system
US11663254B2 (en) System and engine for seeded clustering of news events
Kim et al. Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis
Liu et al. Assessing product competitive advantages from the perspective of customers by mining user-generated content on social media
CN107066599B (en) Similar listed company enterprise retrieval classification method and system based on knowledge base reasoning
Çalı et al. Improved decisions for marketing, supply and purchasing: Mining big data through an integration of sentiment analysis and intuitionistic fuzzy multi criteria assessment
Noh et al. Keyword selection and processing strategy for applying text mining to patent analysis
Day et al. Deep learning for financial sentiment analysis on finance news providers
CN114880486A (en) Industry chain identification method and system based on NLP and knowledge graph
Su et al. Hidden sentiment association in chinese web opinion mining
Chen et al. Enhancement of stock market forecasting using an improved fundamental analysis-based approach
Liu et al. Riding the tide of sentiment change: sentiment analysis with evolving online reviews
Anoop et al. Aspect-oriented sentiment analysis: a topic modeling-powered approach
CN112100512A (en) Collaborative filtering recommendation method based on user clustering and project association analysis
Roh et al. Technology opportunity discovery by structuring user needs based on natural language processing and machine learning
CA2956627A1 (en) System and engine for seeded clustering of news events
Cai et al. PURA: a product-and-user oriented approach for requirement analysis from online reviews
Wahyudin et al. Cluster analysis for SME risk analysis documents based on Pillar K-Means
Li et al. Mining online reviews for ranking products: A novel method based on multiple classifiers and interval-valued intuitionistic fuzzy TOPSIS
KR102563539B1 (en) System for collecting and managing data of denial list and method thereof
Zola et al. Twitter alloy steel disambiguation and user relevance via one-class and two-class news titles classifiers
Du et al. An iterative reinforcement approach for fine-grained opinion mining
Ma et al. Identifying purchase intention through deep learning: analyzing the Q &D text of an E-Commerce platform
Han et al. An evidence-based credit evaluation ensemble framework for online retail SMEs
Sharma et al. Prediction of Customer Review's Helpfulness Based on Feature Engineering Driven Deep Learning Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination