CN114880486A

CN114880486A - Industry chain identification method and system based on NLP and knowledge graph

Info

Publication number: CN114880486A
Application number: CN202210519609.4A
Authority: CN
Inventors: 刘丽萍; 陈康; 袁轶慧; 齐宁; 朱巍; 周云松
Original assignee: Jiangsu United Credit Reference Co ltd
Current assignee: Jiangsu United Credit Reference Co ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-08-09

Abstract

The invention provides an industrial chain identification method and system based on NLP technology and knowledge graph, comprising: aiming at a specific industry chain, extracting a candidate word of the industry chain; aiming at the candidate words, multi-dimensionally constructing the characteristics of the candidate words, calculating scores, fusing the scores and outputting final keywords; clustering the keywords by combining domain knowledge and national economy industry classification; screening enterprises containing keywords, and determining the upstream and downstream positions of the enterprises according to the keyword category labels hit by the enterprises; aiming at enterprises in an industrial chain, multi-dimensional characteristics are constructed, and an industrial chain evaluation score model is built; and outputting a core enterprise list of the middle and the downstream of the industrial chain based on the enterprise scoring result and the upstream and downstream positions of the enterprise in the industrial chain. By means of an NLP technology, keywords for positioning upstream and downstream main business of an industrial chain are extracted, and then upstream and downstream enterprises of the industrial chain are screened; and the core enterprises in the industrial chain can be identified by combining the knowledge graph technology.

Description

Industry chain identification method and system based on NLP and knowledge graph

Technical Field

The invention belongs to the technical field of data processing, relates to NLP and knowledge graph technology, and particularly relates to a method and a system for identifying upstream and downstream enterprises in an industrial chain by using NLP and knowledge graph.

Background

The industrial chain is a chain type incidence relation form objectively formed based on certain technical and economic association among various industrial departments according to a specific logical relation and a space-time layout relation, and is essentially used for describing an enterprise group structure with certain internal association. In the industrial chain, exchange of upstream and downstream relations and mutual values exists in a large number, an upstream link conveys products or services to a downstream link, and a downstream link feeds back information to the upstream link. If some enterprises have business problems, risk propagation and linkage influence can be caused to the enterprises which are upstream and downstream in the industry chain. Therefore, there is a need for analysis and identification of the industry chain. However, in the prior art, a related scheme capable of accurately identifying an industrial chain is not provided.

Disclosure of Invention

In order to solve the problems, the invention provides an industrial chain identification method and system based on NLP technology and knowledge graph, which can identify upstream and downstream enterprises of the industrial chain.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an industrial chain identification method based on NLP technology and knowledge graph comprises the following steps:

step one, aiming at a specific industrial chain, screening an enterprise list of which the 'operating range' information contains industrial chain related fields from enterprise industrial and commercial information; performing word segmentation processing and cleaning on the information of the enterprise 'operating range' by using a word segmentation technology, eliminating stop words, and extracting candidate words of the industry chain;

step two, constructing characteristics of candidate words from multiple dimensions and calculating scores of the candidate words aiming at the candidate words extracted in the step one based on an NLP unsupervised learning algorithm, fusing the scores in a linear weighting mode, and outputting candidate words with total scores of TOP-K as final keywords; the dimension comprises TF-IDF values of the candidate words, main information characteristics of the candidate words, position information characteristics of the candidate words and similarity between the candidate words and the industrial chain theme;

the TF-IDF value of a candidate word is calculated by:

TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)

Wherein the content of the first and second substances,

the position information characteristics of the candidate words are obtained according to the position indexes of the candidate words appearing in the text for the first time;

the fusion formula is as follows:

score＝w ₁ TFIDF value + w ₂ Name main body information + w ₃ Industry subject information + w ₄ Position information + w ₅ Similarity value

w _1- w ₅ Quantifying the weight of the index for each feature;

thirdly, clustering analysis is carried out on the keywords by applying a clustering algorithm and combining domain knowledge and national economy industry classification, and keyword identification of upstream and downstream domains of an industrial chain is carried out; the method specifically comprises the following steps:

using Word vector table generated by Word2Vec to generate K key Word words generated in the step two as K high-dimensional vectors W ₁ ,W ₂ ,……W _K Then, clustering the K word vectors by using a KMeans clustering algorithm; finding the optimal clustering category number and obtaining the clustering center of each category:

comparing the main key words of each category with the national economic industry categories, judging the correlation between the key words of each category and the upstream, the intermediate and the downstream fields, and then corresponding the key words of each category to the corresponding fields, thereby respectively identifying the key words for describing the main business of the industrial chain, the intermediate and the downstream fields;

step four, further screening enterprises with K keywords in the 'operating range' on the basis of the list of the candidate enterprises screened in the step one, and determining the upstream and downstream positions of the enterprises according to the main category labels of the keywords hit by the enterprises;

fifthly, constructing characteristics from multiple dimensions for enterprises in an industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises; the plurality of dimensions includes at least: enterprise basic information, loan information, judicial complaints, enterprise operations, enterprise fund transaction characteristics, external public opinion data, and relationship network/graph characteristics;

extracting the characteristics of each enterprise graph in the industry chain by using a graph calculation and community discovery algorithm: the characteristics of the entrance degree, the center degree, the Pagerank value, the centrifugation degree, the size of a subgraph and the size of a community; meanwhile, the community of each enterprise is output, the enterprise characteristics in each community are further analyzed, and the self strength of the enterprise and the influence of the enterprise on the associated enterprise are comprehensively evaluated by means of a knowledge graph;

constructing and calculating the multi-dimensional characteristics aiming at a sample enterprise, and training a generated two classifiers by using a Logistic regression model to obtain an industrial chain evaluation score model; the evaluation score of each enterprise in the industry chain is output by calling an industry chain evaluation score model;

step six, determining an enterprise list of each industrial chain based on the step four, and positioning whether each enterprise is at an upstream position, a midstream position or a downstream position in the industrial chain; based on the enterprise scoring result generated in the fifth step; and respectively screening out the enterprises with the grades ranked in the front from the upstream, the midstream and the downstream enterprises as a core enterprise list in the midstream and the downstream of the output industrial chain.

Further, the similarity between the candidate word in the second step and the topic of the industry chain is calculated in the following way:

converting the candidate words into Word vectors, performing Word2Vec Word vector model training on the speech, and further performing quantitative analysis on the similarity of the candidate words and the theme of the industry chain to obtain the similarity.

Further, the third step includes the following steps:

representing the keywords by using Word vectors generated by Word2Vec, and clustering the Word vectors by using a KMeans clustering algorithm; finding the optimal clustering category quantity according to an elbow method, and obtaining the clustering center of each category; thereby dividing the keywords into different categories, and selecting the clustering center of each category as a main keyword of the category; and combining domain knowledge and national economy industry classification according to the proximity degree of each category of keywords, and combining the category keywords from the angles of the upstream of the industrial chain, the midstream of the industrial chain and the downstream of the industrial chain respectively so as to identify the keywords for describing the main business of the upstream, the midstream and the downstream fields of the industrial chain.

Further, the fourth step includes the following steps: if the business scope of the enterprise hits a part of the K keywords and the hit keywords exceed a certain proportion and are contained in a business keyword list of a main business of a certain field position in the upstream and downstream, the enterprise is judged to belong to the field position.

Further, in the fifth step, the relationship network/graph characteristics are obtained by: and constructing an association relation network by means of knowledge graph technology based on stockholder relation, fund transaction, group relation, subsidiaries/branch companies, external investment, external guarantee, main personnel information and litigation relation subject data information among enterprises in the industry chain.

Further, the community discovery algorithm uses a Louvain algorithm, the optimization goal is to maximize the modularity of the whole data, and the modularity is calculated as follows:

where m is the total number of edges in the graph, k _i Representing the sum of the connected-edge weights, k, of all pointing vertices i _j In the same way, A _ij Representing the weight of the connecting edge between vertices i and j.

Further, the industry chain evaluation score model is as follows:

where P is the probability that each enterprise is a core enterprise, x ₁ ,x ₂ ,……x _k Is a k characteristic values, beta, constructed according to each dimension ₀ Is an intercept term, beta ₁ ,β ₂ ……β _k Is the coefficient of the k features.

The invention also provides an industrial chain identification system based on the NLP technology and the knowledge graph, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the industrial chain identification method based on the NLP technology and the knowledge graph when being loaded to the processor.

The invention has the beneficial effects that:

by means of an NLP technology, keywords for positioning upstream and downstream main business of an industrial chain are extracted, and then upstream and downstream enterprises of the industrial chain are screened; and then, a knowledge graph technology and a supervised machine learning algorithm are combined to construct an industry chain evaluation score model, and further core enterprises in the industry chain are identified. The method has the advantages that the adopted keyword extraction algorithm is strong in adaptability, the constructed dimensionality is rich and three-dimensional, the keyword output is accurate, the industry chain evaluation scoring model can give quantitative evaluation to the enterprise, and upstream and downstream enterprises of the industry chain can be accurately screened out by combining the upstream and downstream position positioning of the enterprise based on the keyword. The method can construct an incidence relation network among enterprises in an industrial chain, and can predict the propagation direction and the influence degree of the risk when the enterprises in the community have operation risks, thereby helping to identify potential risks.

Drawings

Fig. 1 is a flowchart of an industrial chain identification method based on NLP technology and knowledge graph provided by the present invention.

FIG. 2 is a schematic diagram of a platform architecture for implementing the present invention.

Fig. 3 is a schematic diagram of a process for constructing an industry chain evaluation score model according to the present invention.

Fig. 4 is a schematic diagram of keyword cluster analysis.

Fig. 5 is a schematic diagram of upstream and downstream core enterprise identification.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

The invention provides an industrial chain identification method based on NLP technology and knowledge graph, the supported big data platform technology architecture mainly comprises a general data interface, a calculation frame, a general algorithm frame, data application and the like; the large data platform can support data storage of large data volume and efficient data query and data interaction, and the general algorithm framework adopts a distributed framework, supports large data volume testing and large-scale cluster deployment, and supports data exploration and analysis mining of massive structured and unstructured data. The flow of the industry chain identification method is shown in fig. 1, and comprises the following steps:

step one, aiming at a specific industrial chain, word segmentation technology is applied to perform word segmentation processing and cleaning on information of an enterprise 'operation range', and candidate words of the industrial chain are extracted. Taking the 'Internet of vehicles' industrial chain as an example, screening an enterprise list of which the 'operation range' information contains 'Internet of vehicles' fields from enterprise industrial and commercial information; then, word segmentation is carried out on the information of the 'operation range' of the selected enterprise by using a word segmentation technology to generate candidate words, wherein stop words such as 'ground', 'yes', 'o' and the like which appear at high frequency but do not help the identification of an industry chain need to be eliminated.

And step two, constructing the characteristics of the candidate words from multiple dimensions and calculating scores of the candidate words according to the candidate words extracted in the step one on the basis of an NLP unsupervised learning algorithm, fusing the scores in a linear weighting mode, and outputting the candidate words with total scores of TOP-K as final keywords.

In consideration of the fact that when a supervised algorithm is used for extracting keywords, a large amount of labeled data is needed, labor cost is too high, and a word list needs to be maintained in time, the unsupervised keyword extraction algorithm with high applicability is mainly adopted for extracting the keywords.

Regarding the characteristics of candidate words, the present invention is mainly constructed from the following dimensions:

(1) TF-IDF values of candidate words:

TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)

Wherein the content of the first and second substances,

in general, if a word appears frequently in one document and rarely appears in other documents, i.e., the TF-IDF value of the word is high, the word is considered to have a good category discrimination capability.

(2) Subject information characteristics of candidate words

Two main indicators are considered: whether the candidate appears in "business name"; whether the candidate appears in the "belonging industry class".

For a candidate word, if the candidate word appears in the field of 'enterprise name' or 'affiliated industry classification', the candidate word can be basically positioned to the main business of the enterprise; accordingly, the index may also be given higher weight when the multidimensional feature scores are fused. For example, in the field of raw material production upstream of the "car networking" industry chain, an enterprise with an enterprise name of "instruments and meters manufacturing limited", or an enterprise classified as "dedicated instruments and meters manufacturing" by the industry, the term "instruments and meters" can basically locate the main business of the enterprise.

Specifically, if the candidate word is contained in the "business name", the "name subject information" feature is 1, otherwise, the "name subject information" feature is 0; if the candidate word is contained in the "affiliated industry classification", the "industry subject information" feature is 1, otherwise, the "industry subject information" feature is 0.

(3) Position information characteristic of candidate word

Mainly consider the position index where the candidate word first appears in the text. Generally, candidate words at the beginning or the end (such as the first 10% or the last 10%) of the text are more representative, and the probability that the words describe the main business of the enterprise is higher, and a higher score should be given in terms of the position information characteristics. The specific method comprises the following steps of extracting a position index of a candidate word appearing in a text for the first time:

a) if the position index is within the front or rear 10%, the position information characteristic is 10;

b) if the position index is 10% -30% before or after, the position information characteristic is 5;

c) if the position index is 30% to 50% before or after, the position information characteristic is 1.

(4) Similarity of candidate words to industry chain topics

Taking the "internet of vehicles" industry chain as an example, to calculate the quantitative relationship between candidate words and the "internet of vehicles", the natural language of the words needs to be converted into mathematical information, and after high-dimensional space points are formed by introducing word vectors, the similarity between the words is evaluated by calculating the distance relationship between different points.

To convert words into Word vectors, the Word2Vec Word vector model may be employed. Word2Vec is a Word vector training tool, the tool utilizes a shallow neural network model to automatically learn the appearance of words in a corpus, the Word2Vec Word vector model training is performed on the corpus, the Word internet and candidate words generated in the first step can be embedded into a high-dimensional space, the words are expressed in the form of Word vectors in the high-dimensional space, and then the similarity between the candidate words and the Word internet of vehicles can be quantitatively analyzed. As Word2Vec calculates the cosine value (the value is between 0 and 1), the larger the value is, the higher the association degree between the candidate Word and the Internet of vehicles is. Through the method, the calculation of each dimension characteristic is completed, and all the characteristics are subjected to normalization processing (the values are unified within the range of 0-100). Then, the characteristic quantization indexes are fused in a linear weighting mode to obtain the score of each candidate word:

score＝w ₁ TFIDF value + w ₂ Name body information + w ₃ Industry subject information + w ₄ Position information + w ₅ Similarity value

Wherein, w _1- w ₅ The weights of the indices are quantified for each feature. Finally, the candidate words with the total score of TOP-K (such as TOP50, the value of K can be adjusted according to the needs, and the empirical value is generally taken) are output as the final keywords.

And step three, clustering analysis is carried out on the keywords by applying a clustering algorithm and combining domain knowledge and national economy industry classification, and keyword identification in the upstream and downstream domains of the industrial chain is carried out.

Clustering algorithms aim to discover the relationships between data objects, grouping data such that the similarity within a group is as large as possible and the similarity between groups is as small as possible.

Specifically, according to the second step, after K keywords are generated, the keywords are generated by Word2VecQuantity means, i.e.: k high-dimensional vectors W ₁ ,W ₂ ,……W _K Then, clustering the K word vectors by using a KMeans clustering algorithm; the optimal number of cluster classes can be found according to the Elbow Method (Elbow Method), e.g. into M classes (M)<K) And obtaining the clustering centers of all the categories, namely:

by the method, the keywords can be divided into different categories, the similarity of the keywords in each category is high, the keywords are mainly concentrated in the same industry field, and meanwhile, the clustering center of each category can be selected as one main keyword of the category.

Next, the main keywords of each category (i.e., category center)

) Comparing with national economic industry classification, judging whether each category keyword is related to upstream production data, manufacturing and research and development of midstream, or downstream sales and services by combining with domain knowledge, and then corresponding each category keyword to a corresponding domain, thereby identifying the keywords describing the main business of the upper, middle and downstream domains of the industrial chain, namely: the K keywords are subdivided according to the upper, middle and lower fields of the industry chain to obtain keywords describing main businesses of the upper, middle and lower fields, which are n1, n2 and n3 respectively, as shown in fig. 4.

And step four, further screening enterprises with K keywords (obtained in the step two) in the 'operating range' on the basis of the list of the candidate enterprises screened in the step one, and determining the upstream and downstream positions of the enterprises according to the main category labels of the keywords hit by the enterprises.

Specifically, the "business scope" of an enterprise only hits a part of the K keywords (for example, L < K), and most (for example, more than 70%, which may be adjusted as needed) of the L keywords are included in the n2 keyword lists describing the main business in the midstream domain, so that the enterprise is determined to be in the midstream position of the industry chain.

Based on the steps, the enterprise list in each industry chain is finally determined, and the upstream and downstream positions of each enterprise in the industry chain are located.

And fifthly, constructing characteristics from multiple dimensions for enterprises in the industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises.

The primary dimension data includes the following aspects:

1. basic information of the enterprise: enterprise scale, affiliated industry, industry ranking, registered capital, social insurance payment information, and the like.

2. Loan information: loan scale, loan tendency, repayment behavior, etc.

3. The judicial complaints are: type of complaints, amount of litigation objective, whether or not to be executed/loss of confidence/limit, etc.

4. Enterprise management: asset liability, administrative permissions/penalties, tax ratings, bids, software copyrights, patent applications, internet recruitment, honor qualifications, and time series characteristics of the above, and the like.

5. Enterprise fund transaction characteristics: time, region, frequency, number of strokes and amount of money, counterparty of transaction, etc. of the roll-in and roll-out.

6. External public opinion data: company newspaper, news, financial newspaper, etc.

7. Relational network/graph features.

The relational network/graph characteristic is that an association relational network is constructed by means of knowledge graph technology based on data information such as stockholder relations, fund transaction transactions, group relations, subsidiaries/affiliates, external investment, external guarantee, main personnel information (director of the board of directors), litigation relation subjects and the like among enterprises in an industry chain.

By using graph calculation and community discovery algorithm, the characteristics of each enterprise graph in the industry chain can be extracted: the entrance degree, the center degree, the Pagerank value, the centrifugation degree, the size of a subgraph, the size of a community and other characteristics; meanwhile, communities in which each enterprise is located can be output, and enterprise characteristics (enterprise scale, registered capital, industry ranking and the like) in each community can be further analyzed. By means of the knowledge graph, the self strength of the enterprise and the influence of the enterprise on the related enterprises can be comprehensively evaluated.

The community discovery algorithm mainly uses a Louvain algorithm, the Louvain algorithm is a community algorithm based on modularity, the optimization goal of the community discovery algorithm is to maximize the modularity of the whole data, and the modularity is calculated as follows:

where m is the total number of edges in the graph, k _i Representing the sum of the connected-edge weights, k, of all pointing vertices i _j In the same way, A _ij Represents the weight of the connecting edge between vertices i and j, c _i Denotes the community, δ (c), to which vertex i is assigned _i ，c _j ) And the method is used for judging whether the vertex i and the vertex j are divided in the same community, if so, returning to 1, and otherwise, returning to 0.

The industry chain evaluation score model is a classifier which is constructed and calculated aiming at sample enterprises (enterprises which are verified and have labels in an industry chain, namely, when the enterprises are core enterprises of the industry chain, table is 1, otherwise label is 0), and the multidimensional characteristics are trained by applying a Logistic regression model;

where P is the probability that each enterprise is a core enterprise, x ₁ ,x ₂ ,……x _k Is a k characteristic values, beta, constructed according to the above dimensions ₀ Is an intercept term, beta ₁ ,β ₂ ……β _k Is the coefficient of the k features.

The whole model training and predicting process mainly comprises the following aspects (see fig. 3 for details):

1) data access: most data are stored in a big data platform and can be directly accessed in an SQL mode.

2) Characteristic processing: and carrying out characteristic processing on the accessed data to form characteristic index data, integrating the characteristic index data into a storage process, and encapsulating the characteristic index data in a shell script to facilitate regular calling and calculation.

3) Model training: based on the processed features and sample labels, model training is performed by adopting a supervised machine learning algorithm or a graph algorithm in an algorithm frame, then the model is exported (for example, a PMML file), and the model is deployed online.

4) Batch prediction: inputting the characteristic data of the enterprise to be predicted into the model, and outputting a prediction result after model processing.

By calling the industry chain evaluation score model, the evaluation score (probability value, value is between 0 and 1) of each enterprise in the industry chain can be output, and the higher the score is, the stronger the comprehensive strength of the enterprise is, and the higher the possibility of becoming the core enterprise of the industry chain is.

Step six, based on step four, the enterprise list of each industry chain can be determined, and whether each enterprise is in an upstream, midstream or downstream position in the industry chain is positioned. Based on the fifth step, corresponding enterprise scoring results are generated, and then, enterprises with scoring ranking TOP-N (for example, TOP-10) are respectively screened out from the upstream, midstream and downstream enterprises as a core enterprise list in the middle and downstream of the industry chain, so that typical companies in the industry can be identified, as shown in fig. 5.

The method of the present invention is implemented based on a platform architecture as shown in fig. 2. The method comprises the following steps:

and the data layer comprises basic information data of enterprises and businesses, loan data, business data, judicial complaint data and the like.

And the cleaning layer is used for cleaning and fusing data.

And the index layer processes the characteristic indexes from different dimensions and periodically updates the characteristic indexes in an iterative manner.

And the model layer comprises related algorithms such as NLP extraction keywords, word vector representation and the like, a knowledge graph algorithm and a common machine learning algorithm.

And the application layer comprises modules of industry chain identification, typical enterprise identification, enterprise risk conduction and the like.

Based on the same inventive concept, the embodiment of the present invention further provides an industry chain identification system based on the NLP technology and the knowledge graph, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the computer program is loaded into the processor, the industry chain identification system based on the NLP technology and the knowledge graph implements the above-mentioned industry chain identification method based on the NLP technology and the knowledge graph. The system comprises an industrial chain keyword extraction module and an industrial chain enterprise identification module, wherein the industrial chain keyword extraction module comprises an industrial chain candidate word extraction unit, a keyword output unit and a cluster analysis unit, the industrial chain candidate word extraction unit is used for realizing the function of the first step, the keyword output unit is used for realizing the function of the second step, the cluster analysis unit is used for realizing the function of the third step, the industrial chain enterprise identification module comprises an industrial chain enterprise identification unit, an industrial chain evaluation scoring unit and an industrial chain enterprise output unit, the industrial chain enterprise identification unit is used for realizing the function of the fourth step, the industrial chain evaluation scoring unit is used for realizing the function of the fifth step, and the industrial chain enterprise output unit is used for realizing the function of the sixth step.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it is obvious to those skilled in the art that several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations fall within the protection scope of the claims of the present invention.

Claims

1. An industrial chain identification method based on NLP technology and knowledge graph is characterized by comprising the following steps:

the TF-IDF value of a candidate word is calculated by:

TF-IDF ═ word frequency (TF) X Inverse Document Frequency (IDF)

Wherein the content of the first and second substances,

the main information characteristics of the candidate words are obtained based on whether the candidate words appear in the enterprise names or the corresponding characteristics given in the industry;

the fusion formula is as follows:

w ₁ -w ₅ Quantifying the weight of the index for each feature;

using Word vector table generated by Word2Vec for K keyword keywords generated in the step two as K high-dimensional vectors W ₁ ,W ₂ ,……W _K Then, clustering the K word vectors by using a KMeans clustering algorithm; finding out the optimal cluster category number and obtaining the cluster of each categoryCenter:

comparing the main keywords of each category with the national economy industry classification, judging the correlation between the keywords of each category and the upstream, the middle and the downstream fields, and then corresponding the keywords of each category to the corresponding fields so as to respectively identify the keywords for describing the main business of the upstream, the middle and the downstream fields of the industrial chain;

fifthly, constructing characteristics from multiple dimensions for enterprises in the industrial chain, and constructing an industrial chain evaluation score model to evaluate the comprehensive strength of the enterprises; the plurality of dimensions includes at least: enterprise basic information, loan information, judicial complaints, enterprise operations, enterprise fund transaction characteristics, external public opinion data, and relationship network/graph characteristics;

2. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the similarity between the candidate words and the industry chain topics in the second step is calculated by:

3. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the third step comprises the following processes:

4. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein the step four comprises the following processes: if the business scope of the enterprise hits a part of the K keywords and the hit keywords exceed a certain proportion and are contained in a business keyword list of a main business of a certain field position in the upstream and downstream, the enterprise is judged to belong to the field position.

5. The NLP technology and knowledge-graph based industry chain identification method according to claim 1, wherein in the fifth step, the relationship network/graph characteristics are obtained by: and constructing an association relation network by means of knowledge graph technology based on stockholder relation, fund transaction, group relation, subsidiaries/branch companies, external investment, external guarantee, main personnel information and litigation relation subject data information among enterprises in the industry chain.

6. The NLP and knowledge-graph based industry chain identification method of claim 1, wherein the community discovery algorithm uses the Louvain algorithm, the optimization goal is to maximize the modularity of the whole data, and the modularity is calculated as follows:

7. The NLP and knowledge-graph based industry chain identification method of claim 1, wherein the industry chain evaluation score model is as follows:

8. An NLP technology and knowledge-graph based industry chain identification system, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing the NLP technology and knowledge-graph based industry chain identification method of any one of claims 1 to 7.