WO2023093116A1 - 企业的产业链节点确定方法、装置、终端及存储介质 - Google Patents

企业的产业链节点确定方法、装置、终端及存储介质 Download PDF

Info

Publication number
WO2023093116A1
WO2023093116A1 PCT/CN2022/109615 CN2022109615W WO2023093116A1 WO 2023093116 A1 WO2023093116 A1 WO 2023093116A1 CN 2022109615 W CN2022109615 W CN 2022109615W WO 2023093116 A1 WO2023093116 A1 WO 2023093116A1
Authority
WO
WIPO (PCT)
Prior art keywords
enterprise
information
vector
entity information
entity
Prior art date
Application number
PCT/CN2022/109615
Other languages
English (en)
French (fr)
Inventor
沈浩
吴优
Original Assignee
上海帜讯信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海帜讯信息技术股份有限公司 filed Critical 上海帜讯信息技术股份有限公司
Publication of WO2023093116A1 publication Critical patent/WO2023093116A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of data processing, and specifically relates to a method, device, terminal and storage medium for determining an enterprise's industrial chain node.
  • the existing technology provides a web crawler program, which can automatically grab the information of the enterprise from the free Internet platform.
  • the latest information of the enterprise can be obtained in time, and then the industry to which the enterprise belongs can be determined through classification. chain node.
  • the above method determines the industrial chain node to which the enterprise belongs through the single-dimensional information of the enterprise, resulting in low accuracy in classifying the industrial chain nodes of the enterprise.
  • the main purpose of this application is to provide a method, device, terminal, and storage medium for determining an enterprise's industrial chain node, so as to solve the problem of low accuracy in determining the industrial chain node to which an enterprise belongs in related technologies.
  • this application provides a method for determining the industrial chain node of an enterprise, including:
  • At least one industry chain node corresponding to the enterprise is determined.
  • an entity recognition algorithm is used to identify and process enterprise information to determine enterprise entity information, including:
  • clustering algorithm is used to cluster enterprise entity information to determine enterprise core entity information, including:
  • entity statistics are performed on the clustering results to determine the core entity information of the enterprise, including:
  • the number of previously preset entities is selected as the core entity cluster, and the entities in the core entity cluster are used as the core entity information of the enterprise.
  • At least one industrial chain node corresponding to the enterprise is determined based on the enterprise core entity information, industrial chain information and similarity algorithm, including:
  • the similarity calculation is performed on the enterprise core entity information vector and the industrial chain information vector, and at least one industrial chain node corresponding to the enterprise is determined.
  • the enterprise core entity information and industry chain information are vectorized to obtain the enterprise core entity information vector and industry chain information vector, including:
  • the first text vector is used as the enterprise core entity information vector
  • the second text vector is used as the industry chain information vector.
  • the similarity calculation is performed on the enterprise core entity information vector and the industrial chain information vector, and at least one industrial chain node corresponding to the enterprise is determined, including:
  • an embodiment of the present invention provides a device for determining an enterprise's industrial chain node, including:
  • the identification module is used to identify and process enterprise information by using an entity identification algorithm to determine enterprise entity information
  • the clustering module is used to cluster the enterprise entity information by using a clustering algorithm to determine the core entity information of the enterprise;
  • the node determination module is used to determine at least one industrial chain node corresponding to the enterprise based on the enterprise core entity information, industrial chain information and similarity algorithm.
  • the embodiment of the present invention provides a terminal, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the industrial chain of any one of the above enterprises is realized. The steps of the node determination method.
  • the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of any one of the methods for determining the industrial chain node of an enterprise are realized.
  • the embodiment of the present invention provides a method, device, terminal, and storage medium for determining an enterprise's industry chain node, including: firstly, using an entity recognition algorithm to identify and process enterprise information, determine the enterprise entity information, and then use a clustering algorithm to identify the enterprise entity The information is clustered to determine the core entity information of the enterprise, and then based on the core entity information of the enterprise, the industrial chain information and the similarity algorithm, at least one industrial chain node corresponding to the enterprise is determined.
  • the present invention sequentially identifies and clusters multi-dimensional enterprise information, can effectively remove noise entities, and improves the processing efficiency of enterprise information, and finally classifies the obtained enterprise core entity information through the similarity algorithm to classify industrial chain nodes, It not only improves the accuracy of classification, but also greatly optimizes the interpretability of classification results.
  • Fig. 1 is the implementation flow diagram of a method for determining an industrial chain node of an enterprise provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of the implementation of entity recognition provided by the embodiment of the present invention.
  • Fig. 3 is the realization flowchart of enterprise entity information clustering provided by the embodiment of the present invention.
  • Fig. 4 is a schematic diagram of a clustering result provided by an embodiment of the present invention.
  • Fig. 5 is the realization flowchart of the node classification of enterprise industrial chain provided by the embodiment of the present invention.
  • Fig. 6 is a schematic structural diagram of an enterprise's industry chain node determination device provided by an embodiment of the present invention.
  • Fig. 7 is a schematic diagram of a terminal provided by an embodiment of the present invention.
  • sequence numbers of the processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the implementation order of the embodiments of the present invention.
  • the implementation process constitutes no limitation.
  • “plurality” means two or more.
  • “And/or” is just an association relationship describing associated objects, which means that there can be three kinds of relationships, for example, and/or B, which can mean: A exists alone, A and B exist at the same time, and B exists alone. .
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • “Includes A, B and C” means that A, B, and C are all included, “includes A, B, or C” means includes one of A, B, and C, "Containing A, B and/or C” means containing any 1 or any 2 or 3 of A, B and C.
  • B corresponding to A means that B is associated with A, and according to A It is possible to determine B. Determining B from A does not mean determining B from A alone, B can also be determined from A and/or other information.
  • the matching between A and B means that the similarity between A and B is greater than or equal to a preset threshold.
  • a method for determining an industry chain node of an enterprise comprising the following steps:
  • Step S101 use the entity recognition algorithm to identify and process the enterprise information, and determine the enterprise entity information
  • Step S102 Using a clustering algorithm to cluster enterprise entity information to determine enterprise core entity information
  • Step S103 Based on the enterprise core entity information, industry chain information and similarity algorithm, determine at least one industry chain node corresponding to the enterprise.
  • the present invention does not use the traditional classification algorithm and uses entity recognition instead Algorithms identify enterprise information to determine enterprise entity information, avoiding a lot of work of manual classification and labeling.
  • entity recognition Algorithms identify enterprise information to determine enterprise entity information, avoiding a lot of work of manual classification and labeling.
  • the enterprise information involved in this patent includes the following five information dimensions: enterprise business information, enterprise patent information, enterprise bidding information, enterprise recruitment information and enterprise news information.
  • the business information of enterprises belongs to the public information of enterprises, which is the public information owned by all enterprises, and is also the main information that can be used in the matching of industrial nodes of small and micro enterprises and start-ups.
  • the business information of enterprises to be collected in this patent includes: Company name, business scope, registration time, registered address, etc.
  • Enterprise patent information belongs to the public information of the enterprise, including the core products and technologies of the enterprise, and can describe the core technical capabilities of the enterprise in detail and accurately.
  • the enterprise patent information that needs to be collected in this patent includes: patent name, applicant (enterprise), patent abstract, Patent text, patent application date.
  • Enterprise bidding information belongs to the public information of the enterprise, including the enterprise's demand for the bidding product or the downstream technology industry.
  • the enterprise bidding information that needs to be collected in this patent includes: bidding title, bidding unit, bidding text, bidding time, etc.
  • the information on winning the bid of an enterprise belongs to the public information of the enterprise, including the products or technical capabilities of the enterprise in the upstream of the bid winning product or technology industry. amount etc.
  • the recruitment information of the enterprise on the recruitment platform belongs to the public information of the enterprise, including the technical requirements of the enterprise in a specific position, so as to reflect that the enterprise's business belongs to the related technology or product field.
  • the enterprise recruitment information to be collected in this patent includes: job title, job description , Recruitment time.
  • the news information of the enterprise on the open network platform belongs to the public information of the enterprise, including the relevant industry information of the enterprise.
  • the enterprise news information to be collected in this patent includes: news title, news text, news time, and news source.
  • Enterprise entity information refers to the entity information used to describe the enterprise industry, field, technology, and product.
  • the traditional enterprise classification method will classify according to all the information of the enterprise, there is no effective method to remove information noise.
  • a clustering algorithm is used to screen out the core entity that can effectively describe the business of the enterprise according to the number of entities in the cluster, and at the same time remove the non-core entity and noise entity to obtain the core entity of the enterprise, which effectively improves the accuracy of the final enterprise industry classification result. accuracy.
  • the enterprise core entity information refers to the information defined and described in the enterprise entity information to describe the core business of the enterprise.
  • the present invention uses a similarity algorithm to calculate the core entity information and industrial chain information of the enterprise, not only can accurately classify the industrial results of the enterprise, but also directly output the classified entity information, which greatly optimizes the interpretability of the classification results.
  • the industrial chain information includes industrial chain definition information, industrial chain node and relationship information, and industrial chain node keyword information.
  • the industrial chain definition information includes four dimensions of value chain, enterprise chain, supply and demand chain and space chain.
  • the value chain information needs to be described in detail in five aspects of products, production, sales and after-sales service in the industry, including product definition, which describes the known product names and descriptions in the industry; production definition, which describes the known products in the industry.
  • product definition which describes the known product names and descriptions in the industry
  • production definition which describes the known products in the industry.
  • the production technology the definition of sales, which describes the known sales model in the industry
  • after-sales service which describes the known after-sales service model in the industry.
  • Enterprise chain information needs to be sorted out in detail in terms of leading companies and listed companies in the industry, including leading companies, which describe the known leading companies in various fields in the industry, including company name, main business, main products, etc.; listed companies, That is to describe the known listed companies in the industry, including the name of the company, the information of the public company's annual report, etc.
  • Supply and demand chain information needs to be sorted out in detail for the three aspects of procurement, sales, and warehousing in the industry, including the definition of procurement, which describes the main procurement methods and channels in the industry; the definition of sales, which describes the main sales models and channels in the industry; The definition of warehousing is to describe the main warehousing locations and warehousing costs in the industry.
  • Space chain information needs to sort out the geographical distribution of industrial production and sales in detail, including the definition of production region, which describes the main production regions and production indicators in the industry, and the definition of sales regions, which describes the main sales regions and sales indicators in the industry.
  • the node relationship of the industrial chain includes three types: superordinate relationship, subordinate relationship, and parallel relationship, and the node relationship information needs to establish one-to-one, one-to-many, and many-to-many node relationships among all nodes in the industrial chain.
  • Industry chain node keyword information refers to the need to obtain similar product, technology, and field keywords based on the industry chain node information, so as to facilitate the matching between enterprises and industry nodes in the later stage.
  • the node keywords of "local area communication" include "transmission technology, network topology, basic network, broadband radio, narrow-band (or single-frequency) radio” and so on.
  • An embodiment of the present invention provides a method for determining an enterprise's industrial chain node, including: firstly, using an entity recognition algorithm to identify and process enterprise information to determine the enterprise entity information, and then using a clustering algorithm to cluster the enterprise entity information to determine the enterprise core Entity information, and then based on the enterprise core entity information, industry chain information and similarity algorithm, at least one industry chain node corresponding to the enterprise is determined.
  • the present invention sequentially identifies and clusters multi-dimensional enterprise information, can effectively remove noise entities, and improves the processing efficiency of enterprise information, and finally classifies the obtained enterprise core entity information through the similarity algorithm to classify industrial chain nodes, It not only improves the accuracy of classification, but also greatly optimizes the interpretability of classification results.
  • step S101 includes:
  • Step S201 performing text preprocessing on enterprise information to obtain preprocessed enterprise information
  • Step S202 Select training samples from the preprocessed enterprise information, and use the training samples to train the initial deep neural network model to obtain the target deep neural network model;
  • Step S203 Select forecast samples from the preprocessed enterprise information, input the forecast samples into the target deep neural network model, and output enterprise entity information.
  • the previous embodiment described the multi-dimensional public information used to portray the portrait of the enterprise, which contains important information about the industry and industry nodes in which the enterprise is located.
  • the above enterprise public information is multi-source heterogeneous data, which also contains a lot of noise information, this greatly affects the accuracy of matching between enterprises and industrial chain nodes. Therefore, it is necessary to identify high-value entities that can describe the company's industry, technology, products, and fields from the multi-dimensional public information of the company.
  • entity extraction refers to automatically extracting the position and type of high-value entities from a piece of natural language text. For example, from a piece of corporate news, the company name, product name, technology name, field name, industry node name, etc. involved in the news are automatically identified.
  • step S101 entity recognition is mainly divided into three processes: data preprocessing (ie, text preprocessing), model training, and entity prediction, as follows:
  • (1) Data preprocessing First, text preprocessing is performed on the acquired enterprise information, including text segmentation and sentence segmentation, and entity annotation is performed on the text after sentence segmentation by manual labeling to provide samples for model training. Considering that entities are sparse or even there are many sentences without entities, the sample data is divided into training set, verification set and test set after performing negative sampling operation on the non-entity samples.
  • Model training The Transformer model in the current deep neural network is used to construct the model encoder and decoder. Use the Bert word vector and pre-trained language model to encode the text, and then input the constructed neural network model (this application uses a deep neural network model) for training to minimize the label training error to optimize the model and obtain the target deep neural network Model.
  • Entity prediction Use CRF or fully connected layers to predict labels. Restore the entities in each sentence according to the predicted labels, and extract key entities in the entire text, that is, extract enterprise entity information.
  • step S102 includes:
  • Step S301 Vectorize the enterprise entity information to obtain the enterprise entity information vector
  • Step S302 use the k-means algorithm to perform unsupervised clustering on the enterprise entity information vector, and determine the clustering result;
  • Step S303 Perform entity statistics on the clustering results to determine the core entity information of the enterprise.
  • the obtained enterprise entity information still has the following two problems: First, there is noise in the entity information. Since the enterprise information text types and formats used in this patent are very diverse and complex, there is more noise in the final result of the entity recognition algorithm, which will affect the final enterprise industry classification result. Second, entity vectors are scattered. Since the business of an enterprise often covers multiple industries, fields, technologies, and products, there are often large differences among the identified entities, making it impossible to judge the main business field of an enterprise solely by relying on entity information. Therefore, in order to improve the final enterprise industry classification effect, this patent performs a clustering operation on the identified enterprise entity information. The basic idea of entity clustering is to classify entities according to the distance or similarity between vectors And clustered.
  • the core entity information of the enterprise it is first necessary to count the number of entities in each cluster in the clustering results to obtain the number of multiple entities, and then arrange the numbers of multiple entities in order from large to small to obtain the result of the arrangement. Finally, in the arrangement result, the number of previously preset entities is selected as the core entity cluster, and the entities in the core entity cluster are used as the core entity information of the enterprise.
  • step S102 The enterprise entity information clustering in step S102 will be described below in combination with FIG. 3 and FIG. 4, specifically as follows:
  • FIG 3 The technical process of enterprise entity information clustering in this patent is shown in Figure 3, which is mainly divided into three processes: enterprise entity information vectorization, enterprise entity vector clustering, and core entity determination (ie, eliminating non-core entities and noise entities).
  • the number of entities in each cluster can be calculated, and the top 3 clusters with the number of entities can be defined as core entity clusters, where industry, field, technology, and product entities are identified as information describing the core business of the enterprise.
  • the clusters in the circle part as non-core entity clusters, indicating that the entity information in these clusters is not the most important business information of the enterprise, and define the clusters in the square part, that is, independent entities as noise, indicating that the entity information is not It is not a description of the real business of the enterprise.
  • step S103 includes:
  • Step S401 Carry out vectorization on the enterprise core entity information and industry chain information respectively, and obtain the enterprise core entity information vector and the industry chain information vector;
  • Step S402 Calculate the similarity between the enterprise core entity information vector and the industry chain information vector, and determine at least one industry chain node corresponding to the enterprise.
  • the cosine distance between the enterprise core entity information vector and the industry chain information vector it is first necessary to calculate the cosine distance between the enterprise core entity information vector and the industry chain information vector to obtain the cosine distance value, and then determine the similarity between the enterprise core entity information vector and the industry chain information vector based on the cosine distance value. degree is greater than the preset similarity, associate the enterprise core entity information vector with the industry chain information vector to obtain at least one industry chain node corresponding to the enterprise.
  • step S103 The following takes Figure 5 as an example to illustrate the classification of enterprise industry chain nodes in step S103, which is mainly divided into three processes: vectorization of industry chain information, vector similarity calculation, and output of enterprise industry chain node classification results, as follows:
  • Industrial chain information vectorization Use the word vector database to calculate the text vectors of the industrial chain definition information, industrial chain nodes and relationship information, and industrial chain node keyword information to form text vector representations of industrial chains and industrial chain nodes.
  • the determination method of the second external structure model is similar to the determination method of the first external structure model, which will not be repeated here.
  • Fig. 6 shows a schematic structural diagram of an enterprise's industrial chain node determination device provided by an embodiment of the present invention.
  • an enterprise's industrial chain node determination device Including identification module 61, clustering module 62 and node determination module 63, specifically as follows:
  • the identification module 61 is used to identify and process the enterprise information by using the entity identification algorithm, and determine the enterprise entity information;
  • the clustering module 62 is used to cluster the enterprise entity information by using a clustering algorithm to determine the core entity information of the enterprise;
  • the node determination module 63 is configured to determine at least one industrial chain node corresponding to the enterprise based on the enterprise core entity information, industrial chain information and similarity algorithm.
  • the identification module 61 includes:
  • the preprocessing sub-module is used to perform text preprocessing on enterprise information to obtain preprocessed enterprise information
  • the model training sub-module is used to select training samples from the preprocessed enterprise information, and use the training samples to train the initial deep neural network model to obtain the target deep neural network model;
  • the entity information determination sub-module is used to select prediction samples from the preprocessed enterprise information, input the prediction samples into the target deep neural network model, and output enterprise entity information.
  • the clustering module 62 includes:
  • the first vectorization sub-module is used to vectorize the enterprise entity information to obtain the enterprise entity information vector;
  • the clustering sub-module is used to perform unsupervised clustering on the enterprise entity information vector by using the k-means algorithm, and determine the clustering result;
  • the entity statistics sub-module is used to perform entity statistics on the clustering results and determine the core entity information of the enterprise.
  • the entity statistics submodule includes:
  • the entity number statistics unit is used to count the entity number of each cluster in the clustering result to obtain multiple entity numbers
  • the sorting unit is used to sort the numbers of multiple entities in descending order to obtain the sorting result
  • the core entity information determination unit is configured to select a preset number of entities from the arrangement result as core entity clusters, and use the entities in the core entity clusters as enterprise core entity information.
  • the node determination module 63 includes:
  • the second vectorization sub-module is used to vectorize the enterprise core entity information and the industrial chain information respectively to obtain the enterprise core entity information vector and the industrial chain information vector;
  • the similarity calculation sub-module is used to calculate the similarity between the enterprise core entity information vector and the industrial chain information vector, and determine at least one industrial chain node corresponding to the enterprise.
  • the second vector quantization submodule includes:
  • the text determination unit is used to calculate the first text vector corresponding to the enterprise core entity information and the second text vector corresponding to the industry chain information by using the word vector database;
  • a vector determination unit is configured to use the first text vector as an enterprise core entity information vector, and use the second text vector as an industry chain information vector.
  • the similarity calculation submodule includes:
  • the distance calculation unit is used to calculate the cosine distance between the enterprise core entity information vector and the industry chain information vector to obtain the cosine distance value
  • the similarity calculation unit is used to determine the similarity between the enterprise core entity information vector and the industrial chain information vector based on the cosine distance value;
  • the enterprise classification result determination unit is used to associate the enterprise core entity information vector with the industry chain information vector to obtain at least one industry chain node corresponding to the enterprise if the similarity is greater than the preset similarity.
  • Fig. 7 is a schematic diagram of a terminal provided by an embodiment of the present invention.
  • the terminal 7 of this embodiment includes: a processor 70 , a memory 71 and a computer program 72 stored in the memory 71 and operable on the processor 70 .
  • the processor 70 executes the computer program 72, it realizes the steps in the above embodiments of the method for determining industrial chain nodes of various enterprises, for example, steps 101 to 103 shown in FIG. 1 .
  • the processor 70 executes the computer program 72, it realizes the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules/units 61 to 63 shown in FIG. 6 .
  • the present invention also provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, it is used to implement the methods provided by the above-mentioned various embodiments.
  • the readable storage medium may be a computer storage medium, or a communication medium.
  • Communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium.
  • the readable storage medium can also be a component of the processor.
  • the processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC for short). Additionally, the ASIC may be located in the user equipment.
  • ASIC Application Specific Integrated Circuits
  • the processor and the readable storage medium can also exist in the communication device as discrete components.
  • the readable storage medium may be read only memory (ROM), random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage devices, among others.
  • the present invention also provides a program product, which includes execution instructions, and the execution instructions are stored in a readable storage medium.
  • At least one processor of the device may read the execution instruction from the readable storage medium, and the at least one processor executes the execution instruction so that the device implements the methods provided in the foregoing various implementation manners.
  • the processor may be a central processing unit (English: Central Processing Unit, referred to as: CPU), and may also be other general-purpose processors, digital signal processors (English: Digital Signal Processor, referred to as : DSP), application specific integrated circuit (English: Application Specific Integrated Circuit, referred to as: ASIC), etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in conjunction with the present invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Economics (AREA)
  • Molecular Biology (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种企业的产业链节点确定方法、装置、终端及存储介质。方法包括:利用实体识别算法对企业信息进行识别处理,确定企业实体信息;利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息;基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。本发明将多维度的企业信息依次进行识别和聚类,可有效去除噪音实体,并提高了对企业信息的处理效率,最后通过相似度算法对得到的企业核心实体信息进行产业链节点的分类,不仅提高了分类的准确度,还极大的优化了分类结果的解释性。

Description

企业的产业链节点确定方法、装置、终端及存储介质 技术领域
本申请涉及数据处理技术领域,具体而言,涉及一种企业的产业链节点确定方法、装置、终端及存储介质。
背景技术
随着市场经济的快速发展,众多产业也不断在市场中快速涌现,极大地提升了市场的丰富程度。但是,由于很多产业出现时间比较短,存在产业定义不清、产业边界模糊、产业主体混杂等情况,为产业市场分析和监管提出了新的挑战。因此,如何有效确定企业的产业链节点成为亟待解决的问题。
目前,现有技术提供了网络爬虫程序,该程序可自动从互联网的免费平台抓取企业的信息,通过这种方式,可及时获取到企业的最新信息,进而通过分类处理,确定企业所属的产业链节点。
但是,上述方法通过企业单一维度信息来确定企业所属的产业链节点,导致对企业进行产业链节点分类准确度低。
发明内容
本申请的主要目的在于提供一种企业的产业链节点确定方法、装置、终端及存储介质,以解决相关技术中确定企业所属的产业链节点存在准确度低的问题。
为了实现上述目的,第一方面,本申请提供了一种企业的产业链节点确定方法,包括:
利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息;
基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
在一种可能的实现方式中,利用实体识别算法对企业信息进行识别处理,确定企业实体信息,包括:
对企业信息进行文本预处理,得到预处理后的企业信息;
在预处理后的企业信息选取训练样本,并采用训练样本对初始深度神经网络模型进行训练,得到目标深度神经网络模型;
在预处理后的企业信息中选取预测样本,并将预测样本输入至目标深度神经网络模型中,输出企业实体信息。
在一种可能的实现方式中,利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息,包括:
对企业实体信息进行向量化,得到企业实体信息向量;
利用k-means算法对企业实体信息向量进行无监督聚类,确定聚类结果;
对聚类结果进行实体统计,确定企业核心实体信息。
在一种可能的实现方式中,对聚类结果进行实体统计,确定企业核心实体信息,包括:
统计聚类结果中的每个类簇的实体数目,得到多个实体数目;
将多个实体数目按照从大到小的顺序进行排列,得到排列结果;
在排列结果中选取前预设数量的实体数目作为核心实体簇,并将核心实体簇中的实体作为企业核心实体信息。
在一种可能的实现方式中,基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点,包括:
分别对企业核心实体信息和产业链信息进行向量化,得到企业核心实体信息向量和产业链信息向量;
对企业核心实体信息向量和产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点。
在一种可能的实现方式中,分别对企业核心实体信息和产业链信息进行向量化,得到企业核心实体信息向量和产业链信息向量,包括:
利用词向量数据库分别计算企业核心实体信息对应的第一文本向量和产业链信息对应的第二文本向量;
将第一文本向量作为企业核心实体信息向量,以及将第二文本向量作为产业链信息向量。
在一种可能的实现方式中,对企业核心实体信息向量和产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点,包括:
计算企业核心实体信息向量和产业链信息向量之间的余弦距离,得到余弦距离值;
基于余弦距离值,确定企业核心实体信息向量和产业链信息向量的相似度;
若相似度大于预设相似度,将企业核心实体信息向量与产业链信息向量进行关联,以得到企业对应的至少一个产业链节点。
第二方面,本发明实施例提供了一种企业的产业链节点确定装置,包括:
识别模块,用于利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
聚类模块,用于利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息;
节点确定模块,用于基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
第三方面,本发明实施例提供了一种终端,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现 如上任一种企业的产业链节点确定方法的步骤。
第四方面,本发明实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上任一种企业的产业链节点确定方法的步骤。
本发明实施例提供了一种企业的产业链节点确定方法、装置、终端及存储介质,包括:首先利用实体识别算法对企业信息进行识别处理,确定企业实体信息,再利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息,进而基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。本发明将多维度的企业信息依次进行识别和聚类,可有效去除噪音实体,并提高了对企业信息的处理效率,最后通过相似度算法对得到的企业核心实体信息进行产业链节点的分类,不仅提高了分类的准确度,还极大的优化了分类结果的解释性。
附图说明
构成本申请的一部分的附图用来提供对本申请的进一步理解,使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是本发明实施例提供的一种企业的产业链节点确定方法的实现流程图;
图2是本发明实施例提供的实体识别的实现流程图;
图3是本发明实施例提供的企业实体信息聚类的实现流程图;
图4是本发明实施例提供的聚类结果的示意图;
图5是本发明实施例提供的企业产业链节点分类的实现流程图;
图6是本发明实施例提供的一种企业的产业链节点确定装置的结构示意图;
图7是本发明实施例提供的终端的示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。
应当理解,在本发明的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。
应当理解,在本发明中,“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本发明中,“多个”是指两个或两个以上。“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。“包含A、B和C”、“包含A、B、C”是指A、B、C三者都包含,“包含A、B或C”是指包含A、B、C三者之一,“包含A、B和/或C”是指包含A、B、C三者中任1个或任2个或3个。
应当理解,在本发明中,“与A对应的B”、“与A相对应的B”、“A与B相对应”或者“B与A相对应”,表示B与A相关联,根据A可以确定B。根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其他信息确定B。A与B的匹配,是A与B的相似度大于或等于预设的阈值。
取决于语境,如在此所使用的“若”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。
下面以具体地实施例对本发明的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图通过具体实施例来进行说明。
在一个实施例中,如图1所示,提供了一种企业的产业链节点确定方法,包括以下步骤:
步骤S101:利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
步骤S102:利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息;
步骤S103:基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
由于传统分类算法需要人工进行大量分类正负样本地标注工作,且在分类边界模糊的情况下人工分类难度加大,准确性降低,因此,本发明中没有使用传统的分类算法,改用实体识别算法对企业信息进行识别,以确定企业实体信息,避免了人工分类标注的大量工作。其中,企业信息在公开渠道中分布非常稀疏,因此需要从多个维度进行企业信息的采集,以便于通过企业综合信息进行产业节点匹配。本专利中涉及的企业信息包括以下5个信息维度:企业工商信息、企业专利信息、企业招投标信息、企业招聘信息和企业新闻信息。其中, 企业工商信息属于企业公开信息,是所企业都具有的公开资料,也是在进行小微企业、初创企业的产业节点匹配中能够使用的主要信息,本专利中需采集的企业工商信息包括:企业名称、经营范围、注册时间、注册地址等。企业专利信息属于企业公开信息,包含企业核心的产品和技术,能够详细及准确地描述企业核心技术能力,本专利中需要采集的企业专利信息包括:专利名称、申请人(企业)、专利摘要、专利正文、专利申请日期。企业招标信息属于企业公开信息,包含企业在招标产品或技术产业下游存在需求,本专利中需要采集的企业招标信息包括:招标标题、招标单位、招标正文、招标时间等。企业中标信息属于企业公开信息,包含企业在中标产品或技术产业上游具备产品或技术能力,本专利中需要采集的企业中标信息包括:中标标题、招标单位、中标单位、中标正文、中标时间、中标金额等。企业在招聘平台上的招聘信息属于企业公开信息,包含企业在特定岗位中的技术需求,从而反映企业业务属于相关技术或产品领域,本专利中需要采集的企业招聘信息包括:岗位名称、岗位描述、招聘时间。企业在公开网络平台上的新闻信息属于企业公开信息,包含企业相关的产业信息,本专利中需要采集的企业新闻信息包括:新闻标题、新闻正文、新闻时间、新闻来源。企业实体信息是指用于刻画企业产业、领域、技术、产品的实体信息。
此外,由于传统企业分类方法会根据企业所有信息进行分类,而没有有效去除信息噪音的方法。本发明中利用聚类算法,根据聚类簇的实体数量筛选出能够有效描述企业业务的核心实体,同时去除掉非核心实体和噪音实体,得到企业核心实体,有效提升了最终企业产业分类结果的准确性。其中,企业核心实体信息是指企业实体信息中定义描述企业核心业务的信息。
进一步的,由于传统深度学习或机器学习算法往往只能输出分类结果,但对于分类结果的原因不能提供直观的解释。本发明利用相似度算法对企业核心实体信息和产业链信息进行计算,不仅能够准确进行企业产业结果分类,还能直接输出分类的实体信息,极大优化了分类结果的解释性。其中,产业链信息包括产业链定义信息、产业链节点及关系信息和产业链节点关键词信息。其中,产业链定义信息包含价值链、企业链、供需链和空间链4个维度的信息。其中, 价值链信息需要针对产业中产品、生产、销售和售后服务方面5个方面进行详细描述,包括产品定义,即描述产业中已知的产品名称和描述;生产定义,即描述产业中已知的生产技术;销售定义,即描述产业中已知的销售模式;售后服务定义,即描述产业中已知的售后服务模式。企业链信息需要针对产业中龙头企业、上市公司2个方面进行详细梳理,包括龙头企业,即描述产业中已知的各领域龙头企业,包含企业名称、主营业务、主要产品等;上市公司,即描述产业中已知的上市公司企业,包括企业名称、公开企业年报信息等。供需链信息需要针对产业中采购、销售、仓储3个方面进行详细梳理,包括采购定义,即描述产业中主要的采购方式和采购渠道;销售定义,即描述产业中主要的销售模式和销售渠道;仓储定义,即描述产业中主要的仓储地点和仓储成本。空间链信息需要对产业生产、销售地域分布进行详细梳理,包括生产地域定义,即描述产业中主要生产地域、生产指标等;销售地域定义,即描述产业中主要销售地域、销售指标等。
由于产业链是由产业节点及节点间关系构成,产业链节点及关系信息需要将产业中核心技术和产品进行识别抽取,并且对技术和产品名称进行专业校正,以保证产业链节点的专业、客观和科学性。产业链节点关系包含3种类型:上位关系、下位关系、平行关系,而节点关系信息需要将产业链所有节点间建立起一对一、一对多、多对多的节点关系。
产业链节点关键词信息是指需要根据产业链节点信息,得出相似的产品、技术、领域关键词,便于后期企业与产业节点匹配。比如:“局域通信”的节点关键词包括“传输技术,网络拓扑,基本网络,宽带无线电,窄频带(或单一频率)无线电”等。
本发明实施例提供了一种企业的产业链节点确定方法包括:首先利用实体识别算法对企业信息进行识别处理,确定企业实体信息,再利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息,进而基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。本发明将多维度的企业信息依次进行识别和聚类,可有效去除噪音实体,并提高了对企业 信息的处理效率,最后通过相似度算法对得到的企业核心实体信息进行产业链节点的分类,不仅提高了分类的准确度,还极大的优化了分类结果的解释性。
在一实施例中,步骤S101包括:
步骤S201:对企业信息进行文本预处理,得到预处理后的企业信息;
步骤S202:在预处理后的企业信息选取训练样本,并采用训练样本对初始深度神经网络模型进行训练,得到目标深度神经网络模型;
步骤S203:在预处理后的企业信息中选取预测样本,并将预测样本输入至目标深度神经网络模型中,输出企业实体信息。
上个实施例中描述了用于刻画企业画像的多维度公开信息,这些信息中包含了企业所在产业和产业节点的重要信息。但是,由于以上企业公开信息属于多源异构数据,其中也包含了大量噪音信息,这大大影响了企业与产业链节点匹配的准确度。因此,需要从企业多维度公开信息中,识别出能够刻画企业产业、技术、产品、领域等高价值实体。
由于从企业多维度文本中需要提取的实体类别多、实体依赖语境强等特征,仅依靠传统的基于模板或规则的召回率较低,因此,本专利采用基于深度学习和模板相融合的实体抽取技术,充分发挥深度学习的文本语义理解能力和适应性,以及基于模板的灵活配置能力和高准确性,整体提高模型的查准率和召回率。其中,实体抽取即实体识别NER,是指从一段自然语言文本中自动抽取出高价值实体的位置和类型。例如,从一篇企业新闻中,自动识别出新闻中所涉及的企业名称、产品名称、技术名称、领域名称、产业节点名称等。
下面以图2为例,对步骤S101的实体识别进行说明,其中,实体识别主要分为数据预处理(即文本预处理)、模型训练和实体预测3个过程,具体如下:
(1)数据预处理:首先对已获取的企业信息进行文本预处理,包括文本分段分句,通过人工标注的方式对分句后的文本进行实体标注,为模型的训练提供样本。考虑到实体稀疏甚至无实体的语句较多,因此对无实体样本进行负 采样操作后,将样本数据进行划分为训练集、验证集、测试集。
(2)模型训练:采用当前深度神经网络中的Transformer模型进行模型编码器与解码器的构建。利用Bert词向量和预训练语言模型将文本进行编码,然后输入构造完成的神经网络模型(本申请采用深度神经网络模型)进行训练,以最小化标签训练误差来调优模型,得到目标深度神经网络模型。
(3)实体预测:利用CRF或者全连接层对标签进行预测。根据预测的标签还原每句话中的实体,提取整个文本中的关键实体,即提取企业实体信息。
在一实施例中,步骤S102包括:
步骤S301:对企业实体信息进行向量化,得到企业实体信息向量;
步骤S302:利用k-means算法对企业实体信息向量进行无监督聚类,确定聚类结果;
步骤S303:对聚类结果进行实体统计,确定企业核心实体信息。
通过企业多维度信息实体识别算法,可得到大量刻画企业产业、领域、技术、产品的实体信息。但是由于实体识别算法本身准确性问题,所得到的企业实体信息仍然存在如下两个问题:第一,实体信息存在噪音。由于本专利中使用的企业信息文本类型和格式均非常多样和复杂,导致了实体识别算法最终结果中存在较多的噪音,将会影响最终的企业产业分类结果。第二,实体向量较为分散。由于企业业务往往会涵盖多个产业、领域、技术、产品,因此识别出的实体间往往也存在较大的差异,导致无法单独依靠实体信息来判断企业主要的业务领域。因此,为提升最终企业产业分类效果,本专利针对识别出的企业实体信息进行了聚类操作,实体聚类的基本思想是将实体向量化后,依据向量之间的距离或相似性进行归类且聚成簇。
其中,确定企业核心实体信息,首先需统计聚类结果中的每个类簇的实体数目,得到多个实体数目,再将多个实体数目按照从大到小的顺序进行排列,得到排列结果,最后在排列结果中选取前预设数量的实体数目作为核心实体簇,并将核心实体簇中的实体作为企业核心实体信息。
下面结合图3和图4对步骤S102的企业实体信息聚类进行说明,具体如下:
本专利中的企业实体信息聚类技术流程如图3所示,主要分为企业实体信息向量化、企业实体向量聚类、核心实体确定(即剔除非核心实体和噪音实体)3个过程。
(1)企业实体信息向量化。获取开源的大规模高质量中文词向量数据库,获取企业实体字词的word2vec向量表示。
(2)企业实体向量聚类。利用K-Means算法进行企业实体信息的无监督聚类,并统计聚类结果每个簇的实体数量。
(3)核心实体确定。可基于图4所示的聚类结果,计算每个簇内实体数量,将实体数量前3的簇定义为核心实体簇,其中产业、领域、技术、产品实体认定为描述企业核心业务的信息。将圆圈部分的类簇定义为非核心实体簇,说明这些簇里的实体信息并不是企业最重要的业务信息,并将方块部分的类簇,即独立的实体定义为噪音,说明这些实体信息并不是描述企业真实业务。
在一实施例中,步骤S103包括:
步骤S401:分别对企业核心实体信息和产业链信息进行向量化,得到企业核心实体信息向量和产业链信息向量;
具体的,首先利用词向量数据库分别计算企业核心实体信息对应的第一文本向量和产业链信息对应的第二文本向量,再将第一文本向量作为企业核心实体信息向量,以及将第二文本向量作为产业链信息向量。
步骤S402:对企业核心实体信息向量和产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点。
具体的,首先需计算企业核心实体信息向量和产业链信息向量之间的余弦距离,得到余弦距离值,再基于余弦距离值,确定企业核心实体信息向量和产业链信息向量的相似度,若相似度大于预设相似度,将企业核心实体信息向量与产业链信息向量进行关联,以得到企业对应的至少一个产业链节点。
下面以图5为例对步骤S103的企业产业链节点分类进行说明,主要分为产业链信息向量化、向量相似度计算、输出企业产业链节点分类结果3个过程,具体如下:
(1)产业链信息向量化。利用词向量数据库计算产业链定义信息、产业链节点及关系信息和产业链节点关键词信息的文本向量,形成产业链、产业链节点的文本向量表示。
(2)向量相似度计算。计算企业核心实体向量与产业链节点向量之间的余弦距离,判断企业核心实体与产业链节点是否相似。
(3)输出企业产业链节点分类结果。通过判断余弦距离的大小,将企业与企业核心实体相近的产业链节点信息进行关联,实现企业产业链节点标签的添加,完成企业产业链节点分类。
需要说明的是,第二外结构模型的确定方式与第一外结构模型的确定方式类似,此处不再进行赘述。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。
以下为本发明的装置实施例,对于其中未详尽描述的细节,可以参考上述对应的方法实施例。
图6示出了本发明实施例提供的一种企业的产业链节点确定装置的结构示意图,为了便于说明,仅示出了与本发明实施例相关的部分,一种企业的产业链节点确定装置包括识别模块61、聚类模块62和节点确定模块63,具体如下:
识别模块61,用于利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
聚类模块62,用于利用聚类算法对企业实体信息进行聚类,确定企业核心实体信息;
节点确定模块63,用于基于企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
在一种可能的实现方式中,识别模块61包括:
预处理子模块,用于对企业信息进行文本预处理,得到预处理后的企业信息;
模型训练子模块,用于在预处理后的企业信息选取训练样本,并采用训练样本对初始深度神经网络模型进行训练,得到目标深度神经网络模型;
实体信息确定子模块,用于在预处理后的企业信息中选取预测样本,并将预测样本输入至目标深度神经网络模型中,输出企业实体信息。
在一种可能的实现方式中,聚类模块62包括:
第一向量化子模块,用于对企业实体信息进行向量化,得到企业实体信息向量;
聚类子模块,用于利用k-means算法对企业实体信息向量进行无监督聚类,确定聚类结果;
实体统计子模块,用于对聚类结果进行实体统计,确定企业核心实体信息。
在一种可能的实现方式中,实体统计子模块包括:
实体数目统计单元,用于统计聚类结果中的每个类簇的实体数目,得到多个实体数目;
排序单元,用于将多个实体数目按照从大到小的顺序进行排列,得到排列结果;
核心实体信息确定单元,用于在排列结果中选取前预设数量的实体数目作为核心实体簇,并将核心实体簇中的实体作为企业核心实体信息。
在一种可能的实现方式中,节点确定模块63包括:
第二向量化子模块,用于分别对企业核心实体信息和产业链信息进行向量 化,得到企业核心实体信息向量和产业链信息向量;
相似度计算子模块,用于对企业核心实体信息向量和产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点。
在一种可能的实现方式中,第二向量化子模块包括:
文本确定单元,用于利用词向量数据库分别计算企业核心实体信息对应的第一文本向量和产业链信息对应的第二文本向量;
向量确定单元,用于将第一文本向量作为企业核心实体信息向量,以及将第二文本向量作为产业链信息向量。
在一种可能的实现方式中,相似度计算子模块包括:
距离计算单元,用于计算企业核心实体信息向量和产业链信息向量之间的余弦距离,得到余弦距离值;
相似度计算单元,用于基于余弦距离值,确定企业核心实体信息向量和产业链信息向量的相似度;
企业分类结果确定单元,用于若相似度大于预设相似度,将企业核心实体信息向量与产业链信息向量进行关联,以得到企业对应的至少一个产业链节点。
图7是本发明实施例提供的终端的示意图。如图7所示,该实施例的终端7包括:处理器70、存储器71以及存储在存储器71中并可在处理器70上运行的计算机程序72。处理器70执行计算机程序72时实现上述各个企业的产业链节点确定方法实施例中的步骤,例如图1所示的步骤101至步骤103。或者,处理器70执行计算机程序72时实现上述各装置实施例中各模块/单元的功能,例如图6所示模块/单元61至63的功能。
本发明还提供一种可读存储介质,可读存储介质中存储有计算机程序,计算机程序被处理器执行时用于实现上述的各种实施方式提供的方法。
其中,可读存储介质可以是计算机存储介质,也可以是通信介质。通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。计算机存储介质可以是通用或专用计算机能够存取的任何可用介质。例如,可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称:ASIC)中。另外,该ASIC可以位于用户设备中。当然,处理器和可读存储介质也可以作为分立组件存在于通信设备中。可读存储介质可以是只读存储器(ROM)、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本发明还提供一种程序产品,该程序产品包括执行指令,该执行指令存储在可读存储介质中。设备的至少一个处理器可以从可读存储介质读取该执行指令,至少一个处理器执行该执行指令使得设备实施上述的各种实施方式提供的方法。
在上述设备的实施例中,应理解,处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种企业的产业链节点确定方法,其特征在于,包括:
    利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
    利用聚类算法对所述企业实体信息进行聚类,确定企业核心实体信息;
    基于所述企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
  2. 如权利要求1所述的企业的产业链节点确定方法,其特征在于,所述利用实体识别算法对企业信息进行识别处理,确定企业实体信息,包括:
    对所述企业信息进行文本预处理,得到预处理后的企业信息;
    在所述预处理后的企业信息选取训练样本,并采用所述训练样本对初始深度神经网络模型进行训练,得到目标深度神经网络模型;
    在所述预处理后的企业信息中选取预测样本,并将所述预测样本输入至所述目标深度神经网络模型中,输出所述企业实体信息。
  3. 如权利要求2所述的企业的产业链节点确定方法,其特征在于,所述利用聚类算法对所述企业实体信息进行聚类,确定企业核心实体信息,包括:
    对所述企业实体信息进行向量化,得到企业实体信息向量;
    利用k-means算法对所述企业实体信息向量进行无监督聚类,确定聚类结果;
    对所述聚类结果进行实体统计,确定所述企业核心实体信息。
  4. 如权利要求3所述的企业的产业链节点确定方法,其特征在于,所述对所述聚类结果进行实体统计,确定所述企业核心实体信息,包括:
    统计所述聚类结果中的每个类簇的实体数目,得到多个实体数目;
    将所述多个实体数目按照从大到小的顺序进行排列,得到排列结果;
    在所述排列结果中选取前预设数量的实体数目作为核心实体簇,并将所述 核心实体簇中的实体作为所述企业核心实体信息。
  5. 如权利要求4所述的企业的产业链节点确定方法,其特征在于,所述基于所述企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点,包括:
    分别对所述企业核心实体信息和产业链信息进行向量化,得到企业核心实体信息向量和产业链信息向量;
    对所述企业核心实体信息向量和所述产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点。
  6. 如权利要求5所述的企业的产业链节点确定方法,其特征在于,所述分别对所述企业核心实体信息和产业链信息进行向量化,得到企业核心实体信息向量和产业链信息向量,包括:
    利用词向量数据库分别计算所述企业核心实体信息对应的第一文本向量和所述产业链信息对应的第二文本向量;
    将所述第一文本向量作为所述企业核心实体信息向量,以及将所述第二文本向量作为所述产业链信息向量。
  7. 如权利要求6所述的企业的产业链节点确定方法,其特征在于,所述对所述企业核心实体信息向量和所述产业链信息向量进行相似度计算,确定企业对应的至少一个产业链节点,包括:
    计算所述企业核心实体信息向量和所述产业链信息向量之间的余弦距离,得到余弦距离值;
    基于所述余弦距离值,确定所述企业核心实体信息向量和所述产业链信息向量的相似度;
    若所述相似度大于预设相似度,将所述企业核心实体信息向量与所述产业链信息向量进行关联,以得到所述企业对应的至少一个产业链节点。
  8. 一种企业的产业链节点确定装置,其特征在于,包括:
    识别模块,用于利用实体识别算法对企业信息进行识别处理,确定企业实体信息;
    聚类模块,用于利用聚类算法对所述企业实体信息进行聚类,确定企业核心实体信息;
    节点确定模块,用于基于所述企业核心实体信息、产业链信息和相似度算法,确定企业对应的至少一个产业链节点。
  9. 一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述企业的产业链节点确定方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行所述计算机程序时实现如权利要求1至7中任一项所述企业的产业链节点确定方法的步骤。
PCT/CN2022/109615 2021-11-25 2022-08-02 企业的产业链节点确定方法、装置、终端及存储介质 WO2023093116A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111418591.0 2021-11-25
CN202111418591.0A CN114154829A (zh) 2021-11-25 2021-11-25 企业的产业链节点确定方法、装置、终端及存储介质

Publications (1)

Publication Number Publication Date
WO2023093116A1 true WO2023093116A1 (zh) 2023-06-01

Family

ID=80457994

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109615 WO2023093116A1 (zh) 2021-11-25 2022-08-02 企业的产业链节点确定方法、装置、终端及存储介质

Country Status (2)

Country Link
CN (1) CN114154829A (zh)
WO (1) WO2023093116A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154829A (zh) * 2021-11-25 2022-03-08 上海帜讯信息技术股份有限公司 企业的产业链节点确定方法、装置、终端及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
US20190005078A1 (en) * 2017-07-03 2019-01-03 Leadcrunch, Inc. Method and system for creating and updating entity vectors
CN111445903A (zh) * 2020-03-27 2020-07-24 中国工商银行股份有限公司 企业名称识别方法及装置
CN112395501A (zh) * 2020-11-17 2021-02-23 航天信息股份有限公司 企业推荐方法、装置、存储介质及电子设备
CN113505242A (zh) * 2021-07-16 2021-10-15 珍岛信息技术(上海)股份有限公司 一种知识图谱自动嵌入的方法及系统
CN113553400A (zh) * 2021-07-26 2021-10-26 杭州叙简科技股份有限公司 一种企业知识图谱实体链接模型的构建方法及装置
CN114154829A (zh) * 2021-11-25 2022-03-08 上海帜讯信息技术股份有限公司 企业的产业链节点确定方法、装置、终端及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7840407B2 (en) * 2006-10-13 2010-11-23 Google Inc. Business listing search
CN107342976B (zh) * 2017-05-18 2018-12-21 南京樯图数据科技有限公司 针对企业产业链分析的移动应用平台与方法
CN109255034A (zh) * 2018-08-08 2019-01-22 数据地平线(广州)科技有限公司 一种基于产业链的行业知识图谱构建方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
US20190005078A1 (en) * 2017-07-03 2019-01-03 Leadcrunch, Inc. Method and system for creating and updating entity vectors
CN111445903A (zh) * 2020-03-27 2020-07-24 中国工商银行股份有限公司 企业名称识别方法及装置
CN112395501A (zh) * 2020-11-17 2021-02-23 航天信息股份有限公司 企业推荐方法、装置、存储介质及电子设备
CN113505242A (zh) * 2021-07-16 2021-10-15 珍岛信息技术(上海)股份有限公司 一种知识图谱自动嵌入的方法及系统
CN113553400A (zh) * 2021-07-26 2021-10-26 杭州叙简科技股份有限公司 一种企业知识图谱实体链接模型的构建方法及装置
CN114154829A (zh) * 2021-11-25 2022-03-08 上海帜讯信息技术股份有限公司 企业的产业链节点确定方法、装置、终端及存储介质

Also Published As

Publication number Publication date
CN114154829A (zh) 2022-03-08

Similar Documents

Publication Publication Date Title
CN110516067B (zh) 基于话题检测的舆情监控方法、系统及存储介质
CN109165294B (zh) 一种基于贝叶斯分类的短文本分类方法
CN112184525B (zh) 通过自然语义分析实现智能匹配推荐的系统及方法
Snyder et al. Interactive learning for identifying relevant tweets to support real-time situational awareness
CN107273295B (zh) 一种基于文本混乱度的软件问题报告分类方法
CN102123172B (zh) 一种基于神经网络聚类优化的Web服务发现的实现方法
CN110619051B (zh) 问题语句分类方法、装置、电子设备及存储介质
CN110347840B (zh) 投诉文本类别的预测方法、系统、设备和存储介质
CN112163424A (zh) 数据的标注方法、装置、设备和介质
WO2017091985A1 (zh) 停用词识别方法与装置
CN111563071A (zh) 数据清洗方法、装置、终端设备及计算机可读存储介质
TWI828928B (zh) 高擴展性、多標籤的文本分類方法和裝置
WO2023065642A1 (zh) 语料筛选方法、意图识别模型优化方法、设备及存储介质
WO2023040493A1 (zh) 事件检测
CN113641833B (zh) 服务需求匹配方法及装置
CN113360582B (zh) 基于bert模型融合多元实体信息的关系分类方法及系统
WO2023093116A1 (zh) 企业的产业链节点确定方法、装置、终端及存储介质
CN116451114A (zh) 基于企业多源实体特征信息的物联网企业分类系统及方法
CN115146062A (zh) 融合专家推荐与文本聚类的智能事件分析方法和系统
CN115269870A (zh) 一种基于知识图谱实现数据中台数据链路故障分类预警的方法
WO2021128721A1 (zh) 文本分类处理方法和装置
CN114186022A (zh) 基于语音转录与知识图谱的调度指令质检方法及系统
CN110674288A (zh) 一种应用于网络安全领域的用户画像方法
CN111339258B (zh) 基于知识图谱的大学计算机基础习题推荐方法
WO2023207566A1 (zh) 语音房质量评估方法及其装置、设备、介质、产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897211

Country of ref document: EP

Kind code of ref document: A1