CN110597870A - Enterprise relation mining method - Google Patents

Enterprise relation mining method Download PDF

Info

Publication number
CN110597870A
CN110597870A CN201910716435.9A CN201910716435A CN110597870A CN 110597870 A CN110597870 A CN 110597870A CN 201910716435 A CN201910716435 A CN 201910716435A CN 110597870 A CN110597870 A CN 110597870A
Authority
CN
China
Prior art keywords
enterprise
information
data
name
relationship
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910716435.9A
Other languages
Chinese (zh)
Inventor
马越
吕东方
梁贝贝
李涛
杨茜
姜涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHANGCHUN WHY-E SCIENCE AND TECHNOLOGY Co Ltd
Original Assignee
CHANGCHUN WHY-E SCIENCE AND TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHANGCHUN WHY-E SCIENCE AND TECHNOLOGY Co Ltd filed Critical CHANGCHUN WHY-E SCIENCE AND TECHNOLOGY Co Ltd
Priority to CN201910716435.9A priority Critical patent/CN110597870A/en
Publication of CN110597870A publication Critical patent/CN110597870A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Abstract

An enterprise relation mining method belongs to the field of data mining and comprises the following steps: relationship definition: the enterprise relations comprise a legal relation, a stockholder relation, an occupational relation, a branch organization relation, an external investment relation and a competitive relation; data acquisition: the enterprise data comprises business license information, shareholder information, employee information, branch information and operation range marking information; data cleaning: checking data consistency, processing invalid values and missing values; multi-source data fusion: all the information obtained by investigation and analysis is integrated together, and all the information is evaluated uniformly; and (5) extracting the relation. The enterprise relationship mining is the core of constructing an enterprise relationship map, and the enterprise relationship map can show the enterprise relationship to users in a structured graph, so that the users can conveniently understand and further explore the enterprise relationship. Mining enterprise relationships can discover enterprise social circles, enterprise investment circles, enterprise share right structures, actual controllers of enterprises, enterprise risk assessment and the like.

Description

Enterprise relation mining method
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to an enterprise relationship mining method.
Background
In 2012, google proposed the concept of a knowledge graph for enhancing search engine functionality. The knowledge map is a structured symbolic expression of an objective physical world, and is also a network knowledge base which is formed by linking entities with attributes through relationships, and the relationships also contain the attributes of the relationships. From the perspective of graph theory, a knowledge graph is essentially a conceptual network, with nodes representing entities in the objective physical world and edges representing various semantic relationships that exist between the entities. There are various relationships between businesses and people. Through these relationships, an enterprise relationship network, i.e., an enterprise knowledge graph, can be constructed. The enterprise knowledge graph is constructed, so that potential association of enterprises can be mined from a large amount of disordered data, and an enterprise portrait can be generated.
The key point of enterprise knowledge graph construction is enterprise relationship mining. Common methods for enterprise relationship mining include a rule-based method, a supervised statistical learning-based method, an unsupervised open relationship extraction method, introduction of third-party data and the like. Wherein, the rule-based method is a method for extracting entity relations from texts by using a relation extraction template; the supervised statistical learning-based method is a method for converting a relationship extraction task into a classification problem; the unsupervised open relationship extraction method has the main idea that all verb phrases are regarded as potential relationship verbs to be subjected to relationship extraction, and then a relationship discriminator is used for judging whether each verb phrase correctly expresses a certain relationship; introducing third party data refers to referencing structured business relationship data of a third party data facilitator.
In the scheme, the business relation of the enterprises is extracted from the communique data of listed companies, and the step of carrying out relation mining is to divide words of the communique data of the listed companies and count verb word frequency in the communique. And defining the business relation among the companies into five relations of holding relation, investment relation, transfer relation, merging relation and acquisition relation according to verb word frequency. And defining a feature template for relation extraction, and constructing a relation extractor according to the feature template. The relationship classifier is then trained with the maximum entropy model. This solution has the following problems:
(1) the number of listed companies is only a small fraction of the total number of companies. Non-marketed companies will for the most part not publish their own corporate publication data. Corporate relationship semantic rules defined based on communique data are also not applicable to other source data.
(2) The business relationship classification defined in this scheme is not reasonable. The five relationships can be unified into a shareholder relationship.
(3) The rule-based relationship extractor has high accuracy, low recall rate, poor performance on a new data set and is not beneficial to expansion.
In the technical research on enterprise relationship mining (Harbin university of industry, 2010 research university of Souchai university), the data source for information extraction is an IT field enterprise web page on an Aliskiba website. Information such as business products is extracted from the enterprise web page as a text representing the enterprise information. Based on the assumption that enterprises with similar text descriptions have greater similarity in business, the scheme introduces text similarity and judges the competition size among the enterprises through the similarity value. Because the relationship among enterprises is mainly embodied on the connection among enterprise products, in order to obtain the connection among the products, the scheme introduces an ontology, and reasoning and inquiring are carried out on the products through a domain ontology so as to judge the enterprise relationship according to the product relationship. The scheme simply divides the relationship between enterprises into a competitive relationship and a cooperative relationship. The business model information of enterprises is considered for the enterprises producing the same kind of products. The potential cooperation relationship is judged to exist between the enterprises of the same type products with the operation modes of production operation and distribution wholesale respectively. The scheme adopts Jena toolkit to carry out relationship reasoning. This solution has the following problems:
(1) the data is centralized in a certain industry and is single in type.
(2) The relationship classification is simple. Companies in the same industry may be divided into competing partnerships. It is not appropriate that business relationships of different industries be classified by competitive partnership. In the scheme, only a certain industry enterprise relationship is concerned, and how to extract the inter-industry enterprise competitive cooperation relationship is not indicated. Competitive partnership is only a small part of the business relationships and is not suitable for mining other business relationships.
Research and application of enterprise maps based on big data (university of southern China, research of Master theory thesis in 2017, Yuanyun cloud) divides data sources into a primary data source and a secondary data source. The primary data source is national enterprise credit information public system, China executive information public network, national intellectual property bureau, trademark bureau, China referee document network, copyright bureau, local industry and commerce bureau and other national institutional websites, and has authority and real-time performance. The secondary data sources are enterprise websites such as Tianyan check, enterprise check, Xinbao and the like, and the comprehensive performance is strong, but the data updating is not as timely as that of government websites. The method comprises the steps of firstly obtaining structured data processed by a processing manufacturer from a secondary data source, quickly extracting enterprise relations, and then extracting more updated data from the primary data source according to the established enterprise entities and enterprise relations to update the data to a knowledge graph. In the method, the relationship between enterprises and people is divided into a legal relation, a stockholder relation and an arbitrary relation; the relationships between enterprises are divided into branch relations and external investment relations. The corporate relationship and the shareholder relationship can be obtained from enterprise business registration information; acquiring the job relation from a recruitment website; the branch organization relationship and the external investment relationship are obtained from the query structure of the enterprise investigation website and the like. The government websites and the enterprise websites provide structured data, and can directly use crawlers to obtain and generate 'entity-relationship-entity' and 'entity-attribute value' relationship triples.
Entity alignment and attribute decisions need to be considered during multi-source data fusion. The multi-source data entity alignment includes business name alignment and person name alignment. The method adopted by the enterprise name alignment is to take the hash value of the enterprise name as the enterprise ID, and if the IDs are the same, the same enterprise is obtained. People name alignment takes the approach of treating people as a new entity each time a person name is encountered. With the increase of attributes and relations, the entity range is narrowed down by a method of combining knowledge reasoning with cluster analysis, and finally the entity range is combined into an entity. And the multi-source data attribute decision solves the problem of inconsistent multi-source data attribute values. When an inconsistency occurs, a correct result is selected based on the result of the internet authentication. This solution has the following problems:
(1) third-party data such as enterprise investigation, sky eye investigation and the like are introduced when the relationship is extracted. The third-party structured relationship data is a mined relationship provided by a data service provider, and authority of the relationship data cannot be guaranteed.
(2) The real-time performance of the job information of the personnel recruiting the website is poor and inaccurate. It is reasonable to extract these information as the verification information of the relationship, but not reasonable as the basis for generating the relationship.
(3) The approach used when the business name entities are aligned shows little snaking.
(4) The methods of knowledge reasoning, cluster analysis, etc. used when the names of the people are aligned are not described in detail.
(5) The method used for attribute decision is not clearly described.
In summary, the existing enterprise relationship mining methods have different emphasis points and have advantages and disadvantages. There is no uniform standard for business relationship classification. Various relation extraction methods are derived from different relation divisions. Some enterprise relation mining methods build competitive cooperation relations among enterprises, and focus on data of a certain industry. Some enterprise relation mining puts research emphasis on listed companies, and the relation mining mode of the listed companies is not suitable for relation mining of non-listed companies. In some cases, in order to quickly construct an enterprise relationship, third-party data is introduced during relationship mining, and authority of relationship data cannot be guaranteed.
Disclosure of Invention
In order to solve the problems of the existing enterprise relation mining method, the invention provides an enterprise relation mining method.
The technical scheme adopted by the invention for solving the technical problem is as follows:
the invention discloses an enterprise relationship mining method, which comprises the following steps:
step one, relation definition
The enterprise relations comprise a legal relation, a stockholder relation, an occupational relation, a branch organization relation, an external investment relation and a competitive relation;
step two, data acquisition
The enterprise data comprises business license information, shareholder information, employee information, branch information and operation range marking information;
step three, data cleaning
Checking data consistency, processing invalid values and missing values;
step four, multi-source data fusion
All the information obtained by investigation and analysis is integrated together, and all the information is evaluated uniformly;
and fifthly, extracting the relation.
Further, the step one specifically comprises the following steps:
s101: relationship between legal people
The legal representative is a responsible person for all matters of the company established by a sponsor or a stockholder in law, the legal representative is closely related to the company, and the legal representative and the company have legal relationship;
s102: relationship between shareholders
The shareholder is a capital investor of the company, the initiator and the investor are collectively called as the shareholder, the shareholder can be divided into a personal shareholder and an enterprise shareholder, and the personal shareholder, the enterprise shareholder and the company have shareholder relationship;
s103: relationship of the dutchmanship
The employees of the company have an arbitrary role relationship with the company, and the employees of the company comprise board matters, high management and common employees;
s104: branch organization relationship
The branch is a dispatching organization which does not have independent legal status and to which the main company belongs, the branch has different names in different enterprises or industries, and the branch and the main company have branch relations;
s105: external investment relation
The enterprise invests other enterprises on the name of the enterprise to become stockholders of other enterprises, and the enterprise and the invested enterprises have an external investment relationship;
s106: competitive relationships
Enterprises in the same industry have a competitive relationship, enterprises with high overlapping degree of operation range have a strong competitive relationship, enterprises with low overlapping degree have a weak competitive relationship, enterprises with far geographical positions have a weak competitive relationship, and enterprises with near geographical positions have a strong competitive relationship.
Further, the second step specifically comprises the following steps:
s201: business license information
The business license information comprises a unified social credit code, an enterprise name, a legal representative, a registration authority, a residence and an operation range;
the data source is as follows: yellow page 88 website, one-call-all website, national enterprise credit information public system website;
the data acquisition method comprises the following steps:
s20101: establishing enterprise directory
Respectively opening a yellow page 88 website and a one-call-all website, finding an enterprise directory list and downloading enterprise name data to a database table, namely an enterprise directory table;
s20102: query conditions
Opening a national enterprise credit information public system website, inputting a first enterprise name in an enterprise directory table in an inquiry frame, and downloading inquired business license information data to a database table, namely an enterprise business license information table;
s20103: repeat query
Repeating the step S20102, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s202: shareholder information
The shareholder information comprises a shareholder name, a shareholder type, a certificate type and a certificate number;
the data source is as follows: a national enterprise credit information public system website, a Baidu credit website, a Tianyan check website, an enterprise check website and a Xinbao website;
the data acquisition method comprises the following steps:
s20201: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise directory table in a query box, and downloading queried shareholder information data to a database table, namely an enterprise shareholder information table;
s20202: repeat query
Repeating the step S20201, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s203: employee information
The employee information comprises the name and position of the employee;
the data source is as follows: a national enterprise credit information public system website;
the data acquisition method comprises the following steps:
s20301: query conditions
Opening a national enterprise credit information public system website, inputting a first enterprise name in an enterprise directory table in a query box, and downloading queried main enterprise employee information data to a database table, namely an enterprise employee information table;
s20302: repeat query
Repeating the step S20301, and sequentially recording the next enterprise name of the enterprise directory table until all units are queried;
s204: branch office information
The hierarchical information comprises a branch uniform social credit code and a branch name;
the data source is as follows: the national enterprise credit information public system website, the letter opener and the sky eye checking;
the data acquisition method comprises the following steps:
s20401: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise directory table in a query frame, and downloading queried branch information data to a database table, namely an enterprise branch information table;
s20402: repeat query
Repeating the step S20401, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s205: operational context tagging information
The operation range marking information comprises enterprise names, operation ranges and the industries;
the data source is as follows: looking up a website;
the data acquisition method comprises the following steps:
s20501: query conditions
Opening a sky eye searching website, inputting a first enterprise name in an enterprise name list table in a query frame, and downloading queried branch information data to a database table, namely an enterprise operation range marking table;
s20502: repeat query
And repeating the step S20501, and sequentially inputting the next enterprise name of the enterprise directory table until all units finish querying.
Further, the third step specifically comprises the following steps:
s301: consistency check
Checking whether the data meets the requirements according to the reasonable value range and the mutual relation of each variable, and finding out the data which exceeds the normal range and is logically unreasonable or contradictory;
s302: invalid and missing values.
Further, step 301 specifically includes the following steps:
s30101: unified social credit code inspection
The unified social credit code is 18-digit Arabic letters or capital English letters, and data which do not accord with the coding rule are reset to be null;
s30102: stockholder type check
The shareholder type values comprise shareholders, natural person shareholders, enterprise shareholders, other investors, inner-fund partnership enterprises, enterprise legal persons and legal person shareholders, and other values or null values are reset to the shareholders;
s30103: document type checking
The certificate type value in the shareholder information comprises a partner enterprise business license and a corporate legal person business license, and other values are reset to be null.
Further, step 302 specifically includes the following steps:
s30201: shareholder information processing
Deleting the stockholder information record if the stockholder name field in the stockholder information table is missing;
s30202: employee information processing
Deleting the employee information record if the employee name field in the employee information table is missing;
s30203: branch office information processing
If the branch name field in the branch information table is missing, the branch information record is deleted.
Further, the fourth step specifically comprises the following steps:
s401: enterprise directory deduplication
The enterprise directory is obtained from two source data, and the enterprise names are overlapped. Carrying out duplicate removal processing when multi-source data is fused;
the main key of the enterprise directory table is the enterprise name, the main key constraint is added in the Oracle database, and when the enterprise data is inserted into the database, the records with the same enterprise name cannot be inserted into the enterprise directory table;
s402: attribute decision
The data come from different websites, the attribute values on different websites have conflicts, a confidence coefficient is set for each piece of data, when the attribute values conflict, the attribute value with high confidence coefficient is selected, and the confidence coefficient grade is set as five grades, as follows:
first-stage: the trust degree is extremely low;
and (2) second stage: the trust degree is low;
third-stage: general trust;
and (4) fourth stage: the trust degree is higher;
and (5) fifth stage: the trust degree is high;
data confidence levels are classified according to data sources, and the specified confidence levels are shown in the following table:
data source Confidence level
Government website such as national enterprise credit information public system Five stages
Commercial data service provider website for sky eye investigation, enterprise investigation and the like Four stages
Other websites such as intelligent recruitment, yellow pages 88 and the like Three-stage
Further, step 402 specifically includes the following steps:
s40201: confidence initialization
On the basis of the confidence level, setting an initial confidence level n for each website, wherein n is the confidence level x 100;
s40202: shareholder information table attribute decision
When the shareholder information on a plurality of websites conflicts, judging the confidence coefficient of the shareholder information of each data source, and selecting the attribute value of the data source with high confidence coefficient;
s40203: branch office information table attribute decision
When the shareholder information on a plurality of websites conflicts, judging the confidence coefficient of the shareholder information of each data source, and selecting the attribute value of the data source with high confidence coefficient;
s403: entity alignment
Assuming that the enterprise name is not changed, the enterprise directory information is collected firstly, and then required information is collected from each website according to the enterprise directory information, so that the obtained information can be ensured to belong to the same entity.
Further, the step five specifically comprises the following steps:
s501: relationship between legal people
The corporate relationship is the relationship between enterprise legal representatives and enterprises, one enterprise has one legal representative, and enterprise names and enterprise legal representatives information are extracted from an enterprise business license information table to generate an enterprise-corporate name triple;
s502: relationship between shareholders
The initiator, investor, initiating enterprise and investment enterprise of the enterprise are all shareholders, and enterprise name and shareholder information are extracted from the enterprise shareholder information table to generate enterprise-shareholder name triplets;
s503: relationship of the dutchmanship
Major personnel, high management and staff of the enterprise form an arbitrary relationship with the enterprise, enterprise name and staff name information are extracted from an enterprise staff information table, and an enterprise-arbitrary-staff name triple is generated;
s504: branch organization relationship
The branch establishment applies for registration to related departments, publishes the registration in a national enterprise credit information publicity system website, extracts enterprise names and branch name information from an enterprise branch information table, and generates an enterprise-branch name triple;
s505: external investment relation
The national enterprise credit information publicity system website has no external investment information, the relationship is mutual, the enterprise name and the shareholder name information are extracted from the enterprise shareholder information table, and the shareholder investment enterprises generate an enterprise-investment-enterprise triple;
s506: competitive relationships
Assume one: enterprises belonging to the same industry have a competitive relationship;
assume two: enterprises belonging to the same city have a competitive relationship;
suppose three: enterprises with similar operation ranges have competitive relations;
setting a competition value m to be 0 to 100, and setting an initial value m to be 0;
the race value change rule is shown in the following table:
rules Change
Two enterprises belong to the same industry m+20
Two enterprises belong to the same city m+5
The two enterprises have similar operation ranges m+(10-80)
Further, step 506 specifically includes the following steps:
s50601: competition relationship with the industry
Classifying the industries into 99 classes according to classification standards of the fourth revision of the 2008 international standard industry classification (1 SIC);
80% of data in the enterprise operation range marking table is used for training the classification model, and the rest 20% of data is used for testing the classification model;
(1) reading the operation range information and the affiliated industry information from the enterprise operation range marking table;
(2) segmenting the information of the operation range by using a Jieba segmentation tool to generate a segmentation result set;
(3) removing punctuation stop words in the word segmentation result set;
(4) converting the Chinese words in the Word segmentation set into k-dimensional space vectors by using a Word2vec tool;
(5) the industry is expressed by the number in the International Standard Industrial Classification;
(6) selecting 80% of data, and training a multi-classification model by using Multiclass in a Python language Scikit-learn library;
(7) using the rest 20% of data for model test, and calculating the accuracy of the model;
inputting the enterprise operation range information in the enterprise business license information table into a classification model, calculating the industries of the enterprises, and if the two enterprises belong to the same industry, changing the confidence coefficient according to the rule;
s50602: competition with city
The city information of the enterprise may include enterprise name, registration authority and residence information, and the priority of extracting the city information is the registration authority, enterprise name and residence information;
(1) check-in agency-city information extraction
Extracting the information of the city by using a regular expression "(. about.?) city | area";
(2) enterprise name/residence information-city information extraction
Carrying out named entity recognition on input information by using a natural language processing library pyltp of Harbin industry university;
entities which can be identified by pyltp comprise a person name Nh, an organization name Ni and a place name Ns, and the labeling result of the identification module adopts an O-S-B-I-E labeling form, and the meanings of the labeling result are shown in the following table:
marking Means of
O The word not being an entity
S This word alone constitutes an entity
B Initiation of an entity
I In the middle of an entity
E End of an entity
S50603: competition relationship with business scope
Calculating the business range similarity of the enterprise A and the enterprise B, and specifically comprising the following steps:
(1) reading the business range data of the enterprise A and the business range data of the enterprise B from the enterprise business license information table;
(2) respectively segmenting the operation range information of the enterprise A and the enterprise B by using a Jieba segmentation tool to generate segmentation result sets SEGA and SEGB;
(3) removing punctuation stop words in the word segmentation result set SEGA and SEGB;
(4) converting Chinese words in the segmentations SEGA and SEGB into k-dimensional space vectors vec (A) and vec (B) by using a Word2vec tool;
(5) calculating cosine similarity cos (A, B) of an operation range vector vec (A) of the enterprise A and an operation range vector vec (B) of the enterprise B, wherein the calculation formula is as follows:
wherein cos (A, B) is the cosine similarity of the business range vector of enterprise B of the business range vector of enterprise A, vec (A) is the business range vector of enterprise A; vec (B) is the business scope vector for Enterprise B.
(6) If cos (A, B) is 30%, the competition value is changed to m +10, and then the similarity is improved by 1%, and the competition value is also improved by 1%.
The invention has the beneficial effects that:
the invention provides an enterprise relation mining method which comprises the steps of relation definition, data acquisition, multi-source data fusion, relation graph construction, relation extraction and the like. The enterprise relationship mining is the core of establishing an enterprise relationship map, and the enterprise relationship map can show the enterprise relationship to users in a structured graph, so that the users can conveniently and quickly understand the enterprise relationship map, and the enterprise relationship mining is beneficial to guiding the users to further explore the enterprise relationship map.
The invention defines enterprise relationships as corporate relationships, stockholder relationships, job relationships, branch relationships, investments-outside relationships and competitive relationships. Mining enterprise relationships can discover enterprise social circles, enterprise investment circles, enterprise share right structures, actual controllers of enterprises, enterprise risk assessment and the like.
Drawings
FIG. 1 is a flow chart of a first step of the present invention.
FIG. 2 is a flow chart of step two of the present invention.
FIG. 3 is a flow chart of step three of the present invention.
FIG. 4 is a flowchart of step four of the present invention.
FIG. 5 is a flow chart of step five of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The invention relates to an enterprise relationship mining method, which mainly comprises the following steps:
step one, relation definition
As shown in FIG. 1, the present invention defines business relationships as corporate relationships, stockholder relationships, job relationships, branch relationships, investments relationships, and competitive relationships.
S101: relationship between legal people
The legal representatives are the owners of the company's everything, established by the investors or stockholders act. Legal representatives are closely related to the company. There is a legal relationship between legal representatives and companies.
S102: relationship between shareholders
The shareholder is a capital investor of the company. The sponsor and investor are collectively referred to as the stockholder. Shareholders may be divided into individual shareholders and enterprise shareholders. There is a shareholder relationship between individual shareholders, business shareholders, and companies.
S103: relationship of the dutchmanship
There is an arbitrary relationship between the employees of a company and the company. Employees of a company include board, high pipe, general employees.
S104: branch organization relationship
A branch is a dispatch organization to which a head office belongs that does not have an independent legal role. After an enterprise develops to a certain scale, in order to continue to expand its business and expand its product sales range, branches are often set up in different cities or in different areas of the same city. The branch names are typically the general company name plus a suffix. Branches have different names in different enterprises or industries, for example, some enterprises are called branch companies, some enterprises are called branch factories, business systems are called branch stores, bank systems are called branch banks, etc. The branch office and the head office have a branch office relationship.
S105: external investment relation
The enterprise can invest other enterprises on the name of the enterprise and become stockholders of other enterprises. There is an external investment relationship between the enterprise and the enterprise being invested.
S106: competitive relationships
Enterprises in the same industry have competitive relationships. The competition relationship among enterprises with high overlapping degree of the operating range is strong, and the competition relationship among enterprises with low overlapping degree is weak. The competition relationship between enterprises with far geographical positions is weak, and the competition relationship between enterprises with near geographical positions is strong.
Step two, data acquisition
As shown in fig. 2, the data required for the business relationship mining includes business license information, stockholder information, employee information, branch information, business scope labeling information, and the like.
S201: business license information
The business license information includes information such as a uniform social credit code, a business name, a legal representative, a registration authority, a residence, a business scope, and the like.
The data sources of the business license information are: yellow page 88 website, one-call-all website, national enterprise credit information public system website, etc.
The method for acquiring the data of the business license information specifically comprises the following steps:
s20101: establishing enterprise directory
And respectively opening a yellow page 88 website and a all-in-one website, finding an enterprise directory list and downloading enterprise name data to a database table, namely an enterprise directory table.
S20102: query conditions
And opening a national enterprise credit information public system website. The first business name in the 'business directory table' is input in the query box, and the queried license information data is downloaded to the database table 'business license information table'.
S20103: repeat query
And repeating the step S20102, and sequentially inputting the next enterprise name of the enterprise directory table until all the units are queried.
S202: shareholder information
The shareholder information includes information such as a shareholder name, a shareholder type, a certificate number, and the like.
Data sources for shareholder information are: the system comprises a national enterprise credit information public system website, a Baidu credit website, a Tianyan check website, an enterprise check website, a Xinbao website and the like.
The data acquisition method of the shareholder information specifically comprises the following steps:
s20201: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise name list in a query box, and downloading queried shareholder information data to a database table, namely an enterprise shareholder information table.
S20202: repeat query
And repeating the step S20201, and sequentially recording the next enterprise name of the enterprise directory table until all the units are queried.
S203: employee information
The employee information includes information such as employee name and position.
Data sources of employee information are: the national enterprise credit information publicizing system website.
The data acquisition method of the employee information specifically comprises the following steps:
s20301: query conditions
Opening the national enterprise credit information public system website, inputting the first enterprise name in the 'enterprise directory table' in the query box, and downloading the queried main employee information data of the enterprise to the database table 'enterprise employee information table'.
S20302: repeat query
And repeating the step S20301, and sequentially recording the next enterprise name of the enterprise directory table until all the units finish querying.
S204: branch office information
The hierarchical information includes information such as a branch uniform social credit code, a branch name, etc.
Data sources for the hierarchy information are: the credit information public system website of the national enterprise, the credit treasure, the sky eye check and the like.
The data acquisition method of the hierarchical mechanism information specifically comprises the following steps:
s20401: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise directory table in a query box, and downloading queried branch information data to a database table, namely the enterprise branch information table.
S20402: repeat query
And step S20401 is repeated, and the next enterprise name of the enterprise directory table is sequentially recorded until all units are queried.
S205: operational context tagging information
The operation range marking information comprises enterprise names, operation ranges, affiliated industries and the like.
The data sources of the operation range marking information are as follows: and (5) looking up the website by eyes.
The business license information in the national enterprise credit information publicizing system website has business scope information and no affiliated industry information, and the business scope and the corresponding affiliated industry information are needed subsequently.
The data acquisition method of the operation range marking information specifically comprises the following steps:
s20501: query conditions
Opening a sky eye searching website, inputting a first enterprise name in an enterprise name list table in a query frame, and downloading queried branch information data to a database table, namely an enterprise operation range marking table.
S20502: repeat query
And repeating the step S20501, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried.
Step three, data cleaning
As shown in FIG. 3, data cleansing refers to finding and correcting recognizable errors in a data file, including checking data consistency, processing invalid and missing values, and the like.
S301: consistency check
And checking whether the data are in accordance with requirements according to the reasonable value range and the mutual relation of each variable, and finding out the data which are out of the normal range, are logically unreasonable or are mutually contradictory.
The consistency check comprises the following specific steps:
s30101: unified social credit code inspection
The unified social credit code is 18 digit arabic letters or capital english letters. Data that do not comply with the encoding rules are all reset to null.
S30102: stockholder type check
The shareholder type values include shareholders, natural person shareholders, enterprise shareholders, other investors, internal venture partnership enterprises, enterprise jurisdictions, and corporate shareholders. And resetting other values or null values as shareholders.
S30103: document type checking
The certificate type value in the shareholder information comprises a partner enterprise business license and a corporate legal person business license. And resetting other values to be null.
S302: invalid and missing value handling
Due to investigation, coding and logging errors, there may be some invalid and missing values in the data that need to be given appropriate treatment.
The specific steps of invalid value and missing value processing are as follows:
s30201: shareholder information processing
If the shareholder name field in the shareholder information table is missing, the shareholder information record is deleted.
S30202: employee information processing
And deleting the employee information record if the employee name field in the employee information table is missing.
S30203: branch office information processing
If the branch name field in the branch information table is missing, the branch information record is deleted.
Step four, multi-source data fusion
As shown in fig. 4, the multi-source data fusion refers to integrating all the information obtained by investigation and analysis by using correlation means, and performing uniform evaluation on the information. The data in the invention come from a plurality of different sources, and the data fusion treatment is needed when the data are integrated into a data table.
The specific steps of multi-source data fusion are as follows:
s401: enterprise directory deduplication
The enterprise directory is obtained from two source data, and the enterprise names are overlapped. And when multi-source data is fused, duplicate removal processing is required.
The main key of the enterprise directory table is the enterprise name. Primary key constraints are added in the Oracle database. When enterprise data is inserted into the database, records with the same enterprise name cannot be inserted into the enterprise directory table.
S402: attribute decision
The data originates from different websites. Attribute values on different websites have conflicts, and a confidence level is set for each piece of data. And when the attribute values conflict, selecting the attribute value with high confidence. The confidence level is set to five levels in the present invention.
First-stage: the degree of trust is extremely low.
And (2) second stage: the degree of trust is low.
Third-stage: general trust.
And (4) fourth stage: the degree of trust is higher.
And (5) fifth stage: the trust level is high.
And carrying out data confidence degree grading according to the data source. The specified confidence levels are shown in the table below.
Data source Confidence level
Government website such as national enterprise credit information public system Five stages
Commercial data service provider website for sky eye investigation, enterprise investigation and the like Four stages
Other websites such as intelligent recruitment, yellow pages 88 and the like Three-stage
S40201: confidence initialization
And setting an initial confidence coefficient n for each website on the basis of the confidence coefficient level. n is the confidence rating 100.
S40202: shareholder information table attribute decision
And when the shareholder information on the plurality of websites conflicts, judging the confidence of the shareholder information of each data source. And selecting the attribute value of the data source with high confidence.
S40203: branch office information table attribute decision
And when the shareholder information on the plurality of websites conflicts, judging the confidence of the shareholder information of each data source. And selecting the attribute value of the data source with high confidence.
S403: entity alignment
It is assumed that the business name does not change. The method comprises the steps of firstly collecting enterprise directory information, and then collecting required information from each website according to the enterprise directory information, so that the obtained information can be ensured to belong to the same entity.
Fifth, extracting relationship
As shown in fig. 5, the method specifically includes the following steps:
s501: relationship between legal people
A legal relationship is a relationship between an enterprise legal representative and the enterprise. An enterprise has a legal representative. And extracting the enterprise name and the information of the legal representative of the enterprise from the enterprise license information table to generate an enterprise-legal name triple.
S502: relationship between shareholders
The sponsor, investor, initiating enterprise, and investor of an enterprise are all stockholders. And extracting the enterprise name and the shareholder information from the enterprise shareholder information table to generate an enterprise-shareholder name triple.
S503: relationship of the dutchmanship
The main personnel, the high management and the staff of the enterprise all form an arbitrary relationship with the enterprise. And extracting the enterprise name and the employee name information from the enterprise employee information table to generate an enterprise-job-employee name triple.
S504: branch organization relationship
The branch establishment should apply for registration with the relevant department and publish the information on credit at the website of the national enterprise public system. And extracting the enterprise name and the branch name information from the enterprise branch information table to generate an enterprise-branch name triple.
S505: external investment relation
The credit information publicizing system web site of the national enterprise has no external investment information. The relationships are relative. The stockholder of enterprise a is enterprise B, and it can be said that enterprise B invests in enterprise a. And extracting enterprise name and stockholder name information from the enterprise stockholder information table, and generating an enterprise-investment-enterprise triple by stockholder investment enterprises.
S506: competitive relationships
Assume one: enterprises belonging to the same industry have a competitive relationship.
Assume two: enterprises belonging to the same city have a competitive relationship.
Suppose three: enterprises with similar business scope have competitive relationship.
The contention value m is set to 0 to 100. The initial value m is 0.
The race value change rule is shown in the following table.
Rules Change
Two enterprises belong to the same industry m+20
Two enterprises belong to the same city m+5
The two enterprises have similar operation ranges m+(10-80)
The specific steps of the competition relationship are as follows:
s50601: competition relationship with the industry
The industry was classified into 99 categories according to the classification standard of the fourth revision of the 2008 th International Standard Industrial Classification (1 SIC).
80% of data in the enterprise operation range marking table is used for training the classification model, and the rest 20% of data is used for testing the classification model.
(1) And reading the operation range information and the affiliated industry information from the enterprise operation range marking table.
(2) And segmenting the information of the operation range by using a Jieba segmentation tool to generate a segmentation result set.
(3) And removing stop words such as punctuations and the like in the word segmentation result set.
(4) And converting the Chinese words in the Word segmentation set into k-dimensional space vectors by using a Word2vec tool.
(5) The related industries are expressed by numbers in the international standard industry classification.
(6) 80% of the data were selected and the Multiclass model was trained with Multiclass in the Python language Scikit-left library.
(7) The remaining 20% of the data was used for model testing and model accuracy was calculated.
And inputting the enterprise operation range information in the enterprise business license information table into the classification model, and calculating the industry to which the enterprise belongs. And if the two enterprises belong to the same industry, changing the confidence level according to the rule.
S50602: competition with city
The information of the city where the enterprise is located may be stored in the name of the enterprise, the registration authority, and the residence information. The priority of extracting city information is registration authority, enterprise name and residence information.
(1) Check-in agency-city information extraction
The general format of the registration organ includes XX city industry and commerce administration XX branch office, XX city market supervision administration, XX new district market supervision office, XX district market supervision office, etc. the information of the city is extracted by regular expression "(. about.?) city | district".
(2) Enterprise name/residence information-city information extraction
And carrying out named entity recognition on the input information by using a natural language processing library pyltp of the Hadamard.
Entities that pyltp can recognize include a person name (Nh), an organization name (Ni), and a place name (Ns). And the labeling result of the identification module adopts an O-S-B-I-E labeling form. The meanings are shown in the table below.
Marking Means of
O The word not being an entity
S This word alone constitutes an entity
B Initiation of an entity
I In the middle of an entity
E End of an entity
S50603: competition relationship with business scope
Calculating the business range similarity of the enterprise A and the enterprise B, and specifically comprising the following steps:
(1) reading the business range data of the enterprise A and the business range data of the enterprise B from the enterprise business license information table;
(2) respectively segmenting the operation range information of the enterprise A and the enterprise B by using a Jieba segmentation tool to generate segmentation result sets SEGA and SEGB;
(3) removing punctuation stop words in the word segmentation result set SEGA and SEGB;
(4) converting Chinese words in the segmentations SEGA and SEGB into k-dimensional space vectors vec (A) and vec (B) by using a Word2vec tool;
(5) calculating cosine similarity cos (A, B) of an operation range vector vec (A) of the enterprise A and an operation range vector vec (B) of the enterprise B, wherein the calculation formula is as follows:
wherein cos (A, B) is the cosine similarity of the business range vector of enterprise B of the business range vector of enterprise A, vec (A) is the business range vector of enterprise A; vec (B) is the business scope vector for Enterprise B.
(6) If cos (A, B) is 30%, the competition value is changed to m +10, and then the similarity is improved by 1%, and the competition value is also improved by 1%.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. An enterprise relationship mining method is characterized by comprising the following steps:
step one, relation definition
The enterprise relations comprise a legal relation, a stockholder relation, an occupational relation, a branch organization relation, an external investment relation and a competitive relation;
step two, data acquisition
The enterprise data comprises business license information, shareholder information, employee information, branch information and operation range marking information;
step three, data cleaning
Checking data consistency, processing invalid values and missing values;
step four, multi-source data fusion
All the information obtained by investigation and analysis is integrated together, and all the information is evaluated uniformly;
and fifthly, extracting the relation.
2. The method for mining enterprise relationships according to claim 1, wherein the step one specifically comprises the steps of:
s101: relationship between legal people
The legal representative is a responsible person for all matters of the company established by a sponsor or a stockholder in law, the legal representative is closely related to the company, and the legal representative and the company have legal relationship;
s102: relationship between shareholders
The shareholder is a capital investor of the company, the initiator and the investor are collectively called as the shareholder, the shareholder can be divided into a personal shareholder and an enterprise shareholder, and the personal shareholder, the enterprise shareholder and the company have shareholder relationship;
s103: relationship of the dutchmanship
The employees of the company have an arbitrary role relationship with the company, and the employees of the company comprise board matters, high management and common employees;
s104: branch organization relationship
The branch is a dispatching organization which does not have independent legal status and to which the main company belongs, the branch has different names in different enterprises or industries, and the branch and the main company have branch relations;
s105: external investment relation
The enterprise invests other enterprises on the name of the enterprise to become stockholders of other enterprises, and the enterprise and the invested enterprises have an external investment relationship;
s106: competitive relationships
Enterprises in the same industry have a competitive relationship, enterprises with high overlapping degree of operation range have a strong competitive relationship, enterprises with low overlapping degree have a weak competitive relationship, enterprises with far geographical positions have a weak competitive relationship, and enterprises with near geographical positions have a strong competitive relationship.
3. The method for mining enterprise relationships according to claim 2, wherein the second step specifically comprises the following steps:
s201: business license information
The business license information comprises a unified social credit code, an enterprise name, a legal representative, a registration authority, a residence and an operation range;
the data source is as follows: yellow page 88 website, one-call-all website, national enterprise credit information public system website;
the data acquisition method comprises the following steps:
s20101: establishing enterprise directory
Respectively opening a yellow page 88 website and a one-call-all website, finding an enterprise directory list and downloading enterprise name data to a database table, namely an enterprise directory table;
s20102: query conditions
Opening a national enterprise credit information public system website, inputting a first enterprise name in an enterprise directory table in an inquiry frame, and downloading inquired business license information data to a database table, namely an enterprise business license information table;
s20103: repeat query
Repeating the step S20102, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s202: shareholder information
The shareholder information comprises a shareholder name, a shareholder type, a certificate type and a certificate number;
the data source is as follows: a national enterprise credit information public system website, a Baidu credit website, a Tianyan check website, an enterprise check website and a Xinbao website;
the data acquisition method comprises the following steps:
s20201: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise directory table in a query box, and downloading queried shareholder information data to a database table, namely an enterprise shareholder information table;
s20202: repeat query
Repeating the step S20201, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s203: employee information
The employee information comprises the name and position of the employee;
the data source is as follows: a national enterprise credit information public system website;
the data acquisition method comprises the following steps:
s20301: query conditions
Opening a national enterprise credit information public system website, inputting a first enterprise name in an enterprise directory table in a query box, and downloading queried main enterprise employee information data to a database table, namely an enterprise employee information table;
s20302: repeat query
Repeating the step S20301, and sequentially recording the next enterprise name of the enterprise directory table until all units are queried;
s204: branch office information
The hierarchical information comprises a branch uniform social credit code and a branch name;
the data source is as follows: the national enterprise credit information public system website, the letter opener and the sky eye checking;
the data acquisition method comprises the following steps:
s20401: query conditions
Respectively opening websites in a data source, inputting a first enterprise name in an enterprise directory table in a query frame, and downloading queried branch information data to a database table, namely an enterprise branch information table;
s20402: repeat query
Repeating the step S20401, and sequentially inputting the next enterprise name of the enterprise directory table until all units are queried;
s205: operational context tagging information
The operation range marking information comprises enterprise names, operation ranges and the industries;
the data source is as follows: looking up a website;
the data acquisition method comprises the following steps:
s20501: query conditions
Opening a sky eye searching website, inputting a first enterprise name in an enterprise name list table in a query frame, and downloading queried branch information data to a database table, namely an enterprise operation range marking table;
s20502: repeat query
And repeating the step S20501, and sequentially inputting the next enterprise name of the enterprise directory table until all units finish querying.
4. The method for mining enterprise relationships according to claim 3, wherein step three specifically comprises the following steps:
s301: consistency check
Checking whether the data meets the requirements according to the reasonable value range and the mutual relation of each variable, and finding out the data which exceeds the normal range and is logically unreasonable or contradictory;
s302: invalid and missing values.
5. The method of claim 4, wherein step 301 specifically comprises the following steps:
s30101: unified social credit code inspection
The unified social credit code is 18-digit Arabic letters or capital English letters, and data which do not accord with the coding rule are reset to be null;
s30102: stockholder type check
The shareholder type values comprise shareholders, natural person shareholders, enterprise shareholders, other investors, inner-fund partnership enterprises, enterprise legal persons and legal person shareholders, and other values or null values are reset to the shareholders;
s30103: document type checking
The certificate type value in the shareholder information comprises a partner enterprise business license and a corporate legal person business license, and other values are reset to be null.
6. The method of claim 5, wherein step 302 specifically comprises the steps of:
s30201: shareholder information processing
Deleting the stockholder information record if the stockholder name field in the stockholder information table is missing;
s30202: employee information processing
Deleting the employee information record if the employee name field in the employee information table is missing;
s30203: branch office information processing
If the branch name field in the branch information table is missing, the branch information record is deleted.
7. The method for mining enterprise relationships according to claim 6, wherein the fourth step specifically comprises the following steps:
s401: enterprise directory deduplication
The enterprise directory is obtained from two source data, and the enterprise names are overlapped. Carrying out duplicate removal processing when multi-source data is fused;
the main key of the enterprise directory table is the enterprise name, the main key constraint is added in the Oracle database, and when the enterprise data is inserted into the database, the records with the same enterprise name cannot be inserted into the enterprise directory table;
s402: attribute decision
The data come from different websites, the attribute values on different websites have conflicts, a confidence coefficient is set for each piece of data, when the attribute values conflict, the attribute value with high confidence coefficient is selected, and the confidence coefficient grade is set as five grades, as follows:
first-stage: the trust degree is extremely low;
and (2) second stage: the trust degree is low;
third-stage: general trust;
and (4) fourth stage: the trust degree is higher;
and (5) fifth stage: the trust degree is high;
data confidence levels are classified according to data sources, and the specified confidence levels are shown in the following table:
8. the method of claim 7, wherein step 402 specifically comprises the steps of:
s40201: confidence initialization
On the basis of the confidence level, setting an initial confidence level n for each website, wherein n is the confidence level x 100;
s40202: shareholder information table attribute decision
When the shareholder information on a plurality of websites conflicts, judging the confidence coefficient of the shareholder information of each data source, and selecting the attribute value of the data source with high confidence coefficient;
s40203: branch office information table attribute decision
When the shareholder information on a plurality of websites conflicts, judging the confidence coefficient of the shareholder information of each data source, and selecting the attribute value of the data source with high confidence coefficient;
s403: entity alignment
Assuming that the enterprise name is not changed, the enterprise directory information is collected firstly, and then required information is collected from each website according to the enterprise directory information, so that the obtained information can be ensured to belong to the same entity.
9. The method of claim 8, wherein step five specifically comprises the steps of:
s501: relationship between legal people
The corporate relationship is the relationship between enterprise legal representatives and enterprises, one enterprise has one legal representative, and enterprise names and enterprise legal representatives information are extracted from an enterprise business license information table to generate an enterprise-corporate name triple;
s502: relationship between shareholders
The initiator, investor, initiating enterprise and investment enterprise of the enterprise are all shareholders, and enterprise name and shareholder information are extracted from the enterprise shareholder information table to generate enterprise-shareholder name triplets;
s503: relationship of the dutchmanship
Major personnel, high management and staff of the enterprise form an arbitrary relationship with the enterprise, enterprise name and staff name information are extracted from an enterprise staff information table, and an enterprise-arbitrary-staff name triple is generated;
s504: branch organization relationship
The branch establishment applies for registration to related departments, publishes the registration in a national enterprise credit information publicity system website, extracts enterprise names and branch name information from an enterprise branch information table, and generates an enterprise-branch name triple;
s505: external investment relation
The national enterprise credit information publicity system website has no external investment information, the relationship is mutual, the enterprise name and the shareholder name information are extracted from the enterprise shareholder information table, and the shareholder investment enterprises generate an enterprise-investment-enterprise triple;
s506: competitive relationships
Assume one: enterprises belonging to the same industry have a competitive relationship;
assume two: enterprises belonging to the same city have a competitive relationship;
suppose three: enterprises with similar operation ranges have competitive relations;
setting a competition value m to be 0 to 100, and setting an initial value m to be 0;
the race value change rule is shown in the following table:
rules Change Two enterprises belong to the same industry m+20 Two enterprises belong to the same city m+5 The two enterprises have similar operation ranges m+(10-80)
10. The method of claim 9, wherein step 506 specifically comprises the steps of:
s50601: competition relationship with the industry
Classifying the industries into 99 classes according to classification standards of the fourth revision of the 2008 international standard industry classification (1 SIC);
80% of data in the enterprise operation range marking table is used for training the classification model, and the rest 20% of data is used for testing the classification model;
(1) reading the operation range information and the affiliated industry information from the enterprise operation range marking table;
(2) segmenting the information of the operation range by using a Jieba segmentation tool to generate a segmentation result set;
(3) removing punctuation stop words in the word segmentation result set;
(4) converting the Chinese words in the Word segmentation set into k-dimensional space vectors by using a Word2vec tool;
(5) the industry is expressed by the number in the International Standard Industrial Classification;
(6) selecting 80% of data, and training a multi-classification model by using Multiclass in a Python language Scikit-learn library;
(7) using the rest 20% of data for model test, and calculating the accuracy of the model;
inputting the enterprise operation range information in the enterprise business license information table into a classification model, calculating the industries of the enterprises, and if the two enterprises belong to the same industry, changing the confidence coefficient according to the rule;
s50602: competition with city
The city information of the enterprise may include enterprise name, registration authority and residence information, and the priority of extracting the city information is the registration authority, enterprise name and residence information;
(1) check-in agency-city information extraction
Extracting the information of the city by using a regular expression "(. about.?) city | area";
(2) enterprise name/residence information-city information extraction
Carrying out named entity recognition on input information by using a natural language processing library pyltp of Harbin industry university;
entities which can be identified by pyltp comprise a person name Nh, an organization name Ni and a place name Ns, and the labeling result of the identification module adopts an O-S-B-I-E labeling form, and the meanings of the labeling result are shown in the following table:
marking Means of O The word not being an entity S This word alone constitutes an entity B Initiation of an entity I In the middle of an entity E End of an entity
S50603: competition relationship with business scope
Calculating the business range similarity of the enterprise A and the enterprise B, and specifically comprising the following steps:
(1) reading the business range data of the enterprise A and the business range data of the enterprise B from the enterprise business license information table;
(2) respectively segmenting the operation range information of the enterprise A and the enterprise B by using a Jieba segmentation tool to generate segmentation result sets SEGA and SEGB;
(3) removing punctuation stop words in the word segmentation result set SEGA and SEGB;
(4) converting Chinese words in the segmentations SEGA and SEGB into k-dimensional space vectors vec (A) and vec (B) by using a Word2vec tool;
(5) calculating cosine similarity cos (A, B) of an operation range vector vec (A) of the enterprise A and an operation range vector vec (B) of the enterprise B, wherein the calculation formula is as follows:
wherein cos (A, B) is the cosine similarity of the business range vector of enterprise B of the business range vector of enterprise A, vec (A) is the business range vector of enterprise A; vec (B) is the business scope vector for Enterprise B.
If cos (A, B) is 30%, the competition value is changed to m +10, and then the similarity is improved by 1%, and the competition value is also improved by 1%.
CN201910716435.9A 2019-08-05 2019-08-05 Enterprise relation mining method Pending CN110597870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910716435.9A CN110597870A (en) 2019-08-05 2019-08-05 Enterprise relation mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910716435.9A CN110597870A (en) 2019-08-05 2019-08-05 Enterprise relation mining method

Publications (1)

Publication Number Publication Date
CN110597870A true CN110597870A (en) 2019-12-20

Family

ID=68853498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910716435.9A Pending CN110597870A (en) 2019-08-05 2019-08-05 Enterprise relation mining method

Country Status (1)

Country Link
CN (1) CN110597870A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400504A (en) * 2020-03-12 2020-07-10 支付宝(杭州)信息技术有限公司 Method and device for identifying enterprise key people
CN111414485A (en) * 2020-03-17 2020-07-14 北京恒通慧源大数据技术有限公司 Enterprise customer association relation map construction method and device, storage and computer
CN111461748A (en) * 2020-03-31 2020-07-28 山东浪潮通软信息科技有限公司 Method for defining and showing customer dynamic relation based on CRM system
CN111737594A (en) * 2020-06-24 2020-10-02 中网数据(北京)股份有限公司 Virtual network role behavior modeling method based on unsupervised label generation
CN111913970A (en) * 2020-08-17 2020-11-10 中国科学院地理科学与资源研究所 Cadmium-related enterprise directory construction system and construction method based on industry difference
CN112330459A (en) * 2020-10-22 2021-02-05 北京华彬立成科技有限公司 Method and device for mining enterprise investment and financing event based on business data
CN112529401A (en) * 2020-12-09 2021-03-19 国网天津市电力公司 Enterprise honest risk audit model construction method
CN112784057A (en) * 2021-01-11 2021-05-11 武汉大学 Three-network industrial map construction method based on regional industrial enterprises
CN114547331A (en) * 2022-01-29 2022-05-27 北京金堤科技有限公司 Method and device for generating multi-dimensional map of target object and storage medium
CN114611515A (en) * 2022-01-28 2022-06-10 江苏省联合征信有限公司 Method and system for identifying actual control person of enterprise based on enterprise public opinion information
CN115687470A (en) * 2022-09-28 2023-02-03 江苏科技大学 Enterprise management method and system based on cloud platform
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100005085A1 (en) * 2008-07-03 2010-01-07 Oracle International Corporation Creating relationship maps from enterprise application system data
CN107392456A (en) * 2017-07-14 2017-11-24 武汉理工大学 A kind of multi-angle rating business credit modeling method for merging internet information
CN107945024A (en) * 2017-12-12 2018-04-20 厦门市美亚柏科信息股份有限公司 Identify that internet finance borrowing enterprise manages abnormal method, terminal device and storage medium
CN108959575A (en) * 2018-07-06 2018-12-07 北京神州泰岳软件股份有限公司 A kind of enterprise's incidence relation information mining method and device
CN109189867A (en) * 2018-10-23 2019-01-11 中山大学 Relationship discovery method, apparatus and storage medium based on Corporate Intellectual map
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method of Company Knowledge map is constructed from multi-source data integration visual angle
CN109670944A (en) * 2018-12-19 2019-04-23 信雅达系统工程股份有限公司 A kind of rating business credit method and system based on map relational network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100005085A1 (en) * 2008-07-03 2010-01-07 Oracle International Corporation Creating relationship maps from enterprise application system data
CN107392456A (en) * 2017-07-14 2017-11-24 武汉理工大学 A kind of multi-angle rating business credit modeling method for merging internet information
CN107945024A (en) * 2017-12-12 2018-04-20 厦门市美亚柏科信息股份有限公司 Identify that internet finance borrowing enterprise manages abnormal method, terminal device and storage medium
CN108959575A (en) * 2018-07-06 2018-12-07 北京神州泰岳软件股份有限公司 A kind of enterprise's incidence relation information mining method and device
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method of Company Knowledge map is constructed from multi-source data integration visual angle
CN109189867A (en) * 2018-10-23 2019-01-11 中山大学 Relationship discovery method, apparatus and storage medium based on Corporate Intellectual map
CN109670944A (en) * 2018-12-19 2019-04-23 信雅达系统工程股份有限公司 A kind of rating business credit method and system based on map relational network

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400504A (en) * 2020-03-12 2020-07-10 支付宝(杭州)信息技术有限公司 Method and device for identifying enterprise key people
CN111400504B (en) * 2020-03-12 2023-04-07 支付宝(杭州)信息技术有限公司 Method and device for identifying enterprise key people
CN111414485A (en) * 2020-03-17 2020-07-14 北京恒通慧源大数据技术有限公司 Enterprise customer association relation map construction method and device, storage and computer
CN111461748A (en) * 2020-03-31 2020-07-28 山东浪潮通软信息科技有限公司 Method for defining and showing customer dynamic relation based on CRM system
CN111461748B (en) * 2020-03-31 2023-11-17 浪潮通用软件有限公司 Method for defining and displaying dynamic relationship of clients based on CRM system
CN111737594A (en) * 2020-06-24 2020-10-02 中网数据(北京)股份有限公司 Virtual network role behavior modeling method based on unsupervised label generation
CN111737594B (en) * 2020-06-24 2023-07-25 中网数据(北京)股份有限公司 Virtual network role behavior modeling method based on unsupervised label generation
CN111913970A (en) * 2020-08-17 2020-11-10 中国科学院地理科学与资源研究所 Cadmium-related enterprise directory construction system and construction method based on industry difference
CN112330459B (en) * 2020-10-22 2021-09-28 北京华彬立成科技有限公司 Method and device for mining enterprise investment and financing event based on business data
CN112330459A (en) * 2020-10-22 2021-02-05 北京华彬立成科技有限公司 Method and device for mining enterprise investment and financing event based on business data
CN112529401A (en) * 2020-12-09 2021-03-19 国网天津市电力公司 Enterprise honest risk audit model construction method
CN112784057B (en) * 2021-01-11 2022-05-13 武汉大学 Three-network industrial map construction method based on regional industrial enterprises
CN112784057A (en) * 2021-01-11 2021-05-11 武汉大学 Three-network industrial map construction method based on regional industrial enterprises
CN114611515A (en) * 2022-01-28 2022-06-10 江苏省联合征信有限公司 Method and system for identifying actual control person of enterprise based on enterprise public opinion information
CN114611515B (en) * 2022-01-28 2023-12-12 江苏省联合征信有限公司 Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN114547331A (en) * 2022-01-29 2022-05-27 北京金堤科技有限公司 Method and device for generating multi-dimensional map of target object and storage medium
CN115687470A (en) * 2022-09-28 2023-02-03 江苏科技大学 Enterprise management method and system based on cloud platform
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene
CN116702899B (en) * 2023-08-07 2023-11-28 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene

Similar Documents

Publication Publication Date Title
CN110597870A (en) Enterprise relation mining method
WO2021103492A1 (en) Risk prediction method and system for business operations
He et al. A database linking Chinese patents to China’s census firms
CN112182246B (en) Method, system, medium, and application for creating an enterprise representation through big data analysis
CN110781246A (en) Enterprise association relationship construction method and system
Huang et al. Institution name disambiguation for research assessment
CN109492097B (en) Enterprise news data risk classification method
Jacob et al. sCooL: A system for academic institution name normalization
CN103678279A (en) Figure uniqueness recognition method based on heterogeneous network temporal semantic path similarity
CN110837568A (en) Entity alignment method and device, electronic equipment and storage medium
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
He et al. Construction of a database linking SIPO patents to firms in China’s Annual Survey of Industrial Enterprises 1998-2009
CN110825817B (en) Enterprise suspected association judgment method and system
Winkler Record linkage
Zealand Data integration manual
CN109977131A (en) A kind of house type matching system
Chen et al. Data analysis and knowledge discovery in web recruitment—based on big data related jobs
CN109460895A (en) Construct the method and system of social unit portrait
CN109885797B (en) Relational network construction method based on multi-identity space mapping
CN116595173A (en) Data processing method, device, equipment and storage medium for policy information management
Jabeen et al. Divided we stand out! forging cohorts for numeric outlier detection in large scale knowledge graphs (conod)
Priya et al. Entity resolution for high velocity streams using semantic measures
Marple et al. Collapsing corporate confusion: Leveraging network structures for effective entity resolution in relational corporate data
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
CN112818215A (en) Product data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination