CN111708774B

CN111708774B - An industrial analysis system based on big data

Info

Publication number: CN111708774B
Application number: CN202010298191.XA
Authority: CN
Inventors: 崔晓君; 陈俊琰; 王怡宁
Original assignee: EAST CHINA INSTITUTE OF TELECOMMUNICATIONS
Current assignee: EAST CHINA INSTITUTE OF TELECOMMUNICATIONS
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2023-03-10
Anticipated expiration: 2040-04-16
Also published as: CN111708774A

Abstract

The invention relates to an industry analysis system based on big data, which comprises an industry development related database module, a data analysis model module, a data base platform module and a user side interface module; the industry development related database module is used for storing certain industry development related data resources; the data analysis model module is used for storing the data analysis model; the data base platform module is simultaneously connected with the industry development related database module and the data analysis model module and is used for calling related data from the industry development related database module according to an analysis target, then calling a related model from the data analysis model module and inputting the related data into the related model; the data analysis model module is also used for outputting an analysis result after the relevant data is input into the relevant model; the user terminal interface module is connected with the data analysis model module and used for displaying the analysis result. The industrial analysis system can provide data support for industrial development and make decisions based on data analysis.

Description

An industrial analysis system based on big data

技术领域technical field

本发明属于产业分析系统技术领域，涉及一种基于大数据的产业分析系统。The invention belongs to the technical field of industrial analysis systems, and relates to an industrial analysis system based on big data.

背景技术Background technique

数据资源作为信息社会的重要生产要素、无形资产和社会财富，已成为国家基础性战略资源。通过对数据进行深入分析，能够总结经验、发现规律、预测趋势、辅助决策。经济发展进入新常态，会出现很多新情况、新问题，急需科学研判、及时应对，把政策工具箱的工具备好、备足，这就对传统的产业监测调度方式提出了更高的要求。As an important factor of production, intangible assets and social wealth in the information society, data resources have become the basic strategic resources of the country. Through in-depth analysis of data, it is possible to summarize experience, discover laws, predict trends, and assist decision-making. Economic development has entered a new normal, and there will be many new situations and new problems. There is an urgent need for scientific research and judgment, timely response, and the preparation and preparation of tools in the policy toolbox. This puts forward higher requirements for traditional industrial monitoring and scheduling methods.

政府决策迫切需要掌握产业发展相关信息，包括从宏观到微观各个层面的信息，为产业转型升级、招商引资、企业发展扶持等政策的制定和调整提供量化决策依据，以实现更加精准有效的开展有关配套工作，引导和推动着上海大数据产业快速、健康、有序的发展。企业发展迫切需要掌握产业发展动态，市场发展动态。Government decision-making urgently needs to grasp information related to industrial development, including information from macro to micro levels, so as to provide quantitative decision-making basis for the formulation and adjustment of policies such as industrial transformation and upgrading, investment promotion, and enterprise development support, so as to achieve more accurate and effective development. The supporting work guides and promotes the rapid, healthy and orderly development of Shanghai's big data industry. The development of enterprises urgently needs to grasp the dynamics of industrial development and market development.

目前产业发展研究工作的推动和研究以比较分散的方式开展，主要分别对各个因素开展研究，或者是基于专家经验，难以满足快速推进中的某一产业发展需要。通过建立产业大数据分析平台，可以解决这一问题，给政府部门掌握产业发展现状，制定决策提供有效帮助，有效促进产业发展。At present, the promotion and research of industrial development research is carried out in a relatively decentralized manner, and the research is mainly carried out on each factor separately, or based on expert experience, which is difficult to meet the needs of a certain industry development in rapid progress. Through the establishment of an industrial big data analysis platform, this problem can be solved, and government departments can grasp the status quo of industrial development, provide effective assistance in making decisions, and effectively promote industrial development.

发明内容Contents of the invention

为了克服上述技术的不足，本发明的目的在于提供了一种基于大数据的产业分析系统，可以分析影响该产业发展的相关因素，总结经验、发现规律、预测趋势、辅助决策，为产业发展提供数据支撑，基于数据分析做出决策而非仅仅是专家经验。In order to overcome the deficiencies of the above-mentioned technologies, the purpose of the present invention is to provide an industrial analysis system based on big data, which can analyze relevant factors affecting the development of the industry, sum up experience, discover laws, predict trends, and assist decision-making, providing industry development with Data support, making decisions based on data analysis rather than just expert experience.

为达到上述目的，本发明采用的技术方案如下：In order to achieve the above object, the technical scheme adopted in the present invention is as follows:

一种基于大数据的产业分析系统，包括产业发展相关数据库模块、数据分析模型模块、数据基础平台模块和用户端界面模块；An industrial analysis system based on big data, including an industrial development-related database module, a data analysis model module, a data foundation platform module, and a user interface module;

产业发展相关数据库模块用于存储某产业发展相关数据资源；产业发展相关数据库模块通过数据治理，形成高度可用的数据资产，面向数据基础平台模块和数据分析模型模块提供支撑，支持数据查询和业务分析的需要；The industrial development-related database module is used to store data resources related to an industry development; the industrial development-related database module forms highly available data assets through data governance, provides support for the data basic platform module and data analysis model module, and supports data query and business analysis needs;

产业发展相关数据库模块的技术实现主要依靠以下技术：The technical realization of the database module related to industrial development mainly relies on the following technologies:

(1)能支持实时大数据处理的国产新型数据库，其具有以下功能：(1) A new domestic database that can support real-time big data processing, which has the following functions:

(1.1)大规模并行处理：存储建立在HDFS上、将计算中间结果放在内存中；(1.1) Large-scale parallel processing: the storage is built on HDFS, and the intermediate results of the calculation are placed in memory;

(1.2)列式存储2.0：增强型元数据、字典编码数据、数据自动排序；(1.2) Columnar storage 2.0: enhanced metadata, dictionary encoded data, and automatic data sorting;

(1.3)动态数据分配：广播小数据表、大表哈希值动态再分配、Join本地化、管道；(1.3) Dynamic data allocation: broadcast small data tables, dynamic redistribution of large table hash values, Join localization, pipelines;

(1.4)内存计算：向量化处理、基于底层虚拟机(LLVM)的动态编译；(1.4) Memory computing: vectorized processing, dynamic compilation based on the underlying virtual machine (LLVM);

(1.5)OLTP/OLAP双分析引擎：兼容OLTP和OLAP数据分析处理需求；(1.5) OLTP/OLAP dual analysis engine: Compatible with OLTP and OLAP data analysis and processing requirements;

(2)数据接口系统与人机接口系统，其包括：(2) Data interface system and man-machine interface system, including:

(2.1)机器类数据接口与采集处理系统，其包括：(2.1) Machine-type data interface and collection and processing system, including:

(2.1.1)基础设施运行环境数据引擎处理子系统，能够适配并可配置MIB信息等；(2.1.1) Infrastructure operating environment data engine processing subsystem, capable of adapting and configuring MIB information, etc.;

(2.1.2)SYSLOG日之类数据引擎处理子系统；(2.1.2) Data engine processing subsystems such as SYSLOG;

(2.1.3)大数据产业及相关企业信息收集处理子系统；(2.1.3) Big data industry and related enterprise information collection and processing subsystems;

(2.2)业务类数据接口系统，业务综合数据接口应支持多种接口格式与人机交互混合模式，以适应多来源冷、热数据的接入处理等，业务数据接口格式种类扩展支持与更新管理，应对不同来源的大数据相关业务数据能够支持RestAPI/SNMP/SYSLOG/文件等多种格式的接口支持，支持实时、定时、条件触发等更新频率模式，并可配置，相关接口协议规范与标准定制列举如下：(2.2) Business data interface system, business comprehensive data interface should support multiple interface formats and human-computer interaction mixed mode, to adapt to multi-source cold and hot data access processing, etc., business data interface format types expansion support and update management In response to big data related business data from different sources, it can support interfaces in various formats such as RestAPI/SNMP/SYSLOG/files, support real-time, timing, condition triggering and other update frequency modes, and can be configured, and relevant interface protocol specifications and standard customization Listed below:

(2.2.1)SYSLOG接口，物联网、机器设备数据标准工业接口规范，包括服务器主机、通用网络设备、专用安全设备的系统日志应通过SYSLOG协议发送至安全管控系统，安全管控系统通过约定的SYSLOG消息格式，解析日志信息，并对日志信息进行规范、分析、关联等处理；(2.2.1) SYSLOG interface, Internet of Things, machine equipment data standard industrial interface specifications, including system logs of server hosts, general network equipment, and special security equipment should be sent to the security management and control system through the SYSLOG protocol, and the security management and control system passes the agreed SYSLOG Message format, parsing log information, and standardizing, analyzing, and correlating log information;

(2.2.2)SNMP接口，物联网设备、服务器主机、通用网络设备、专用安全设备的系统日志应实现基本的SNMP协议MIB库，安全管控系统通过SNMP协议方式采集设备基本信息、采集设备运行性能协议以及接收设备SNMP Trap事件通知信息；(2.2.2) SNMP interface, the system logs of IoT devices, server hosts, general network devices, and special security devices should implement the basic SNMP protocol MIB library, and the security management and control system collects the basic information of the devices and the operating performance of the devices through the SNMP protocol. Protocol and receiving device SNMP Trap event notification information;

(2.2.3)REST接口，REST接口协议是WebService(Web服务)的一种实现方式，主要应用于系统间的接口实现，移动警务平台中，除了在设备管理层面上使用通用的SNMP、SYSLOG协议以外，软件系统间、策略配置下发等场景应使用REST接口协议实现，一般情况下，应采用请求响应模式实现监测信息报送、策略与指令下发等，请求与响应通信协议应采用HTTP1.1 over SSL/TLS，请求响应模式应使用HTTP作为接口实现协议，接口参数、返回结果均使用JSON对象，数据包大小不超过5MB，在网络反向不可达的情况下，可采用消息推送模式实现策略指令下发，通信协议应采用HTTP2.0的WEBSOCKET over SSL/TLS，应使用HTTP作为接口实现协议，接口参数、返回结果均使用JSON对象，数据包大小不超过5MB；(2.2.3) REST interface, the REST interface protocol is an implementation of WebService (Web service), which is mainly used to realize the interface between systems. In the mobile police platform, in addition to using general SNMP and SYSLOG on the equipment management level In addition to protocols, scenarios such as inter-software systems and policy configuration delivery should be implemented using the REST interface protocol. In general, the request-response mode should be used to implement monitoring information reporting, policy and instruction delivery, etc. The request and response communication protocol should use HTTP1 .1 over SSL/TLS, the request response mode should use HTTP as the interface implementation protocol, the interface parameters and return results all use JSON objects, the data packet size does not exceed 5MB, and the message push mode can be used when the reverse network is unreachable To implement the delivery of policy instructions, the communication protocol should adopt HTTP2.0 WEBSOCKET over SSL/TLS, and HTTP should be used as the interface implementation protocol. The interface parameters and return results should use JSON objects, and the size of the data packet should not exceed 5MB;

(2.2.4)应用系统日志接口，应用系统日志接口采用标准的SDK接口，用来发送终端应用/业务域日志数据，为保证采集性能，应用接口支持按集群方式部署，可根据要采集的日志量及并发数的大小，选择集群的节点数；(2.2.4) Application system log interface. The application system log interface uses a standard SDK interface to send terminal application/business domain log data. In order to ensure collection performance, the application interface supports cluster deployment. The amount and the size of the concurrent number, select the number of nodes in the cluster;

(2.3)人机接口系统服务，其包括：(2.3) Human-machine interface system services, including:

(2.3.1)前端人机接口系统，为产业数万家企业建立主动服务的交互式数据采集人机接口；(2.3.1) The front-end human-machine interface system establishes an active service interactive data collection human-machine interface for tens of thousands of enterprises in the industry;

(2.3.2)安全平台数据交换前置服务平台，为产业分析平台建立安全数据通讯平台等；(2.3.2) Security platform data exchange pre-service platform, establish a security data communication platform for the industry analysis platform, etc.;

(2.3.3)企业相关数据清点与梳理服务、接口协商与定制服务；(2.3.3) Enterprise-related data inventory and combing services, interface negotiation and customization services;

数据分析模型模块用于存储数据分析模型；The data analysis model module is used to store the data analysis model;

数据基础平台模块同时与产业发展相关数据库模块和数据分析模型模块连接，用于根据分析目标从产业发展相关数据库模块中调取相关数据后(将分析目标与相关数据匹配以顺利实现调取是已知技术，例如分析目标是针对近五年成立的企业进行分析，则从数据库中调用成立时间年份大于2015年的企业数据；又例如分析目标是针对“核心企业”，则从数据库中调用打有“核心企业”标签的企业数据)，从数据分析模型模块中调取相关模型(将分析目标与相关模型匹配以顺利实现调取是已知技术，例如分析目标是“根据产业链进行分类”，则在分析模块中根据模型名称，手动选择“产业链模型”进行调用)，将相关数据输入到相关模型中；数据基础平台模块为系统建设提供有力的平台支撑，缩短开发周期，降低系统建设风险，提升性能和稳定性；进行数据维护，包括数据补全与更新；进行数据查询，可进行向导式搜索和自定义报表，对于查询结果，具备条件筛选、排序、向上汇总、向下钻取、简单运算、条件格式等，支持结果的导出和打印；进行可视化展示，可通过内置多种智能可视化算法，实现海量数据的多维多终端立体呈现；The data base platform module is connected with the industrial development-related database module and the data analysis model module at the same time, and is used to retrieve relevant data from the industrial development-related database module according to the analysis target (matching the analysis target with the relevant data to achieve the smooth retrieval is done For example, if the analysis target is to analyze the enterprises established in the past five years, then the data of enterprises whose establishment time is greater than 2015 will be called from the database; "Enterprise data of the "core enterprise" label), retrieve relevant models from the data analysis model module (it is a known technology to match the analysis target with the relevant model to achieve smooth retrieval, for example, the analysis target is "classification according to the industrial chain", Then in the analysis module, according to the model name, manually select the "industrial chain model" to call), and input the relevant data into the relevant model; the data basic platform module provides a strong platform support for the system construction, shortens the development cycle, and reduces the risk of system construction , to improve performance and stability; data maintenance, including data completion and update; data query, wizard search and custom reports are available, and query results can be filtered, sorted, summarized upward, drilled down, Simple calculations, conditional formatting, etc., support the export and printing of results; for visual display, a variety of intelligent visualization algorithms can be built in to achieve multi-dimensional and multi-terminal three-dimensional presentation of massive data;

数据基础平台模块的技术实现依靠一个可视化的大数据管理、分析与展现的平台，功能如下：The technical implementation of the data base platform module relies on a visual big data management, analysis and display platform, the functions are as follows:

(a)基于元数据的数据管理：为用户构建一套规范、统一、通用的大数据资源；(a) Metadata-based data management: build a set of standardized, unified and common big data resources for users;

(b)常规多维分析+开放建模：为用户提供多维分析、基于R语言的建模分析；(b) Conventional multidimensional analysis + open modeling: provide users with multidimensional analysis and modeling analysis based on R language;

(c)兼容传统、大数据存储：兼容关系型DB、Hadoop、NoSQL等存储类型；(c) Compatible with traditional and big data storage: Compatible with relational DB, Hadoop, NoSQL and other storage types;

(d)可视化动态图表分析：为用户提供拖拽式、符合国人操作习惯的图表展示方式；(d) Visual dynamic chart analysis: provide users with a drag-and-drop chart display method that conforms to the operating habits of Chinese people;

数据分析模型模块还用于在相关数据输入到相关模型中后，输出分析结果；The data analysis model module is also used to output the analysis results after the relevant data is input into the relevant model;

用户端界面模块与数据分析模型模块连接，用于显示分析结果；用户端模块包括产业经济运行、产业管理、生产力布局等重点应用，每个业务应用均包含三部分内容，即现状监测、专题分析和趋势研判；The user-end interface module is connected with the data analysis model module to display the analysis results; the user-end module includes key applications such as industrial economic operation, industrial management, and productivity layout. Each business application includes three parts, namely status monitoring and special analysis and trend analysis;

用户端界面模块的技术实现主要依靠以下技术：The technical realization of the user interface module mainly relies on the following technologies:

1)HTML1)HTML

HTML是建立Web界面所需的最核心技术；这是一种用于描述浏览器所显示的文档结构的基于标签的语言；HTML is the core technology required to build a web interface; it is a tag-based language used to describe the structure of documents displayed by browsers;

2)CSS(层叠样式表)2) CSS (Cascading Style Sheets)

在web应用程序中，用于指定HTML内容在屏幕上的呈现方式；In web applications, it is used to specify how HTML content is rendered on the screen;

3)JavaScript3) JavaScript

面向对象的JavaScript，同时支持面向对象、命令式和声明式；Object-oriented JavaScript, supporting object-oriented, imperative and declarative at the same time;

4)超链接4) hyperlink

web应用程序中的超链接通常包含预先设定的请求参数，这些数据项不需由用户输人，而是由服务器将其插人用户单击的超链接的目标URL中，大数据产业分析系统中项目储备库、友情链接等都是以这种方式开发；Hyperlinks in web applications usually contain pre-set request parameters. These data items do not need to be input by the user, but are inserted by the server into the target URL of the hyperlink clicked by the user. Big Data Industry Analysis System The project reserve library, friendship links, etc. are all developed in this way;

5)表单5) Form

接收用户输入；receive user input;

6)文档对象模型DOM6) Document Object Model DOM

7)Ajax(Asynchronous JavaScript and XML，异步的JavaScript和XML)7) Ajax (Asynchronous JavaScript and XML, asynchronous JavaScript and XML)

AJAX是与服务器交换数据并更新部分网页的艺术，在不重新加载整个页面的情况下；AJAX is the art of exchanging data with a server and updating parts of a web page without reloading the entire page;

大数据产业系统使用Ajax，一些用户操作将由客户端脚本代码进行处理，并且不需要重新加载整个页面；相反，脚本会“在后台”执行请求，并且通常会收到较小的响应，用于动态更新一部分用户界面；Big data industrial systems use Ajax, some user actions will be handled by client-side script code, and the whole page does not need to be reloaded; instead, the script will execute the request "in the background" and usually receive a smaller response for dynamic Update part of the user interface;

Ajax使用的核心技术XML Http Request，经过一定程度的标准整合之后，这种技术现在已转化为一个本地JavaScript对象，客户端脚本可以通过该对象提出“后台”请求，而无须窗口级别的导航事件；The core technology XML Http Request used by Ajax, after a certain degree of standard integration, this technology has now been transformed into a local JavaScript object, through which client scripts can make "background" requests without window-level navigation events;

8)JSON(JavaScript Object Notation，JavaScript对象表示法)8) JSON (JavaScript Object Notation, JavaScript Object Notation)

大数据产业系统用JSON来存储和交换文本信息，是轻量级的文本数据交换格式；Big data industry systems use JSON to store and exchange text information, which is a lightweight text data exchange format;

Ajax应用程序经常使用JSON，以替换最初用于数据传输的XML格式，JSON比XML更小、更快，更易解析；Ajax applications often use JSON to replace the XML format originally used for data transmission. JSON is smaller, faster, and easier to parse than XML;

通常，如果用户执行某个操作，客户端JavaScript将使用XML Http Request将该操作传送到服务器，服务器则返回一个包含JSON格式的数据的轻量级响应，然后，客户端脚本将处理这些数据，并对用户界面进行相应地更新；Usually, if the user performs an operation, the client-side JavaScript will use XML Http Request to transmit the operation to the server, and the server will return a lightweight response containing data in JSON format, and then the client-side script will process the data and Update the user interface accordingly;

9)同源策略9) Same Origin Policy

大数据产业地图系统核心：防止不同来源的内容相互干扰，只允许相同来源的内容进行交互；The core of the big data industry map system: prevent content from different sources from interfering with each other, and only allow content from the same source to interact;

从一个网站收到的内容可以读取并修改从该站点收到的其他内容，但不得访问从其他站点收到的内容，如果不使用同源策略，那么，当不知情的用户浏览到某个恶意网站时，在该网站上运行的脚本代码将能够访问这名用户同时访问的任何其他网站的数据和功能；Content received from one site can read and modify other content received from that site, but not access content received from other sites, and if the Same Origin Policy is not used, then when an unsuspecting user browses When a malicious website is accessed, script code running on that website will be able to access the data and functionality of any other website that the user is visiting at the same time;

用户端界面模块提供用户结果查看功能，用户可查看产业经济运行、产业管理、生产力布局等提供现状监测、专题分析和趋势研判数据分析结果。The user terminal interface module provides the user result viewing function, and the user can view the industrial economic operation, industrial management, productivity layout, etc. to provide status monitoring, thematic analysis and trend research and judgment data analysis results.

作为优选的方案：As a preferred solution:

如上所述的一种基于大数据的产业分析系统，某产业发展相关数据资源包括与某产业发展相关的行业数据、企业数据、区域数据和技术数据。As mentioned above in an industry analysis system based on big data, the data resources related to the development of a certain industry include industry data, enterprise data, regional data and technical data related to the development of a certain industry.

如上所述的一种基于大数据的产业分析系统，数据分析模型包括产业分类模型、产业链模型和企业绩效评估模型；产业分类模型用于根据待分类企业的企业经营范围，确定企业的产业分类；产业链模型用于通过加权评分法，根据待分类企业的数据，在“资源”、“技术”、“应用”、“产业支撑”四个类别中的评分，确定企业在产业链中的位置，即在产业链中的类别；企业绩效评估模型用于根据与企业综合能力、研发能力、行业影响力、持续运营能力和数据应用能力相关的数据(即企业的注册资本、上市情况、业务收入、研发投入等数据)对企业综合能力、研发能力、行业影响力、持续运营能力和数据应用能力进行打分后，计算加权总分得到企业绩效。As mentioned above, an industry analysis system based on big data, the data analysis model includes an industry classification model, an industry chain model and an enterprise performance evaluation model; the industry classification model is used to determine the industry classification of the enterprise according to the business scope of the enterprise to be classified ;The industrial chain model is used to determine the position of the enterprise in the industrial chain through the weighted scoring method, according to the scores of the four categories of "resources", "technology", "application" and "industrial support" according to the data of the enterprises to be classified , that is, the category in the industrial chain; the enterprise performance evaluation model is used based on the data related to the enterprise's comprehensive ability, research and development ability, industry influence, sustainable operation ability and data application ability (that is, the enterprise's registered capital, listing status, business income , R&D investment and other data) After scoring the enterprise's comprehensive ability, R&D ability, industry influence, sustainable operation ability and data application ability, calculate the weighted total score to get the enterprise performance.

如上所述的一种基于大数据的产业分析系统，产业分类模型结构相对简单，采用决策树算法，决策树分类很符合人类分类时的思想，决策树分类时会提出很多不同的问题，判断样本的某个特征，然后综合所有的判断结果给出样本的类别，产业分类模型的建立流程如下：As mentioned above, an industry analysis system based on big data has a relatively simple structure of the industry classification model. It adopts the decision tree algorithm. The decision tree classification is in line with the thinking of human classification. Many different questions will be raised during the decision tree classification. Judging the sample A certain feature of the model, and then combine all the judgment results to give the category of the sample. The establishment process of the industry classification model is as follows:

(1)开始；(1) start;

(2)以企业数据集(样本全集)作为根节点创建树；(2) Create a tree with the enterprise data set (complete set of samples) as the root node;

(3)创建节点；(3) Create a node;

(4)判断企业数据集(样本全集)是否为空，如果是，则返回上一节点后，结束；反之，则进入下一步；(4) Determine whether the enterprise data set (complete set of samples) is empty, if yes, return to the previous node, and end; otherwise, enter the next step;

(5)判断当前节点数据集(样本子集)是否属于同类属性，如果是，则记为叶节点并标记为类C后，结束；反之，则进入下一步；(5) Determine whether the current node data set (sample subset) belongs to the same attribute, if so, record it as a leaf node and mark it as class C, and then end; otherwise, enter the next step;

(6)判断候选属性集是否为空，如果是，则记为S中含样本数量最多的类C后，结束；反之，则进入下一步；(6) Determine whether the candidate attribute set is empty, if yes, record it as the class C with the largest number of samples in S, and end; otherwise, enter the next step;

(7)计算集合中每个企业条件属性的信息增益率；(7) Calculate the information gain rate of each enterprise condition attribute in the set;

(8)选择候选集中最大的信息增益作为当前节点的分割属性；(8) Select the largest information gain in the candidate set as the segmentation attribute of the current node;

(9)根据分割属性的值确定企业数据集(样本子集)，建立相应分支；(9) Determine the enterprise data set (sample subset) according to the value of the segmentation attribute, and establish corresponding branches;

(10)对企业数据集(样本子集)连续递归运行函数，返回步骤(2)；(10) continuously recursively run the function to the enterprise data set (sample subset), and return to step (2);

产业分类模型的工作流程如下：The workflow of the industry classification model is as follows:

(1)开始；(1) start;

(2)输入待分类企业的经营范围数据；(2) Input the business scope data of the enterprise to be classified;

(3)通过决策树算法，对企业进行产业分类；(3) Through the decision tree algorithm, the industry classification of enterprises;

(4)结束。(4) END.

如上所述的一种基于大数据的产业分析系统，企业的产业链划分标准相对模糊，一家企业可能同时存在于产业链中的多个位置，因此，单一的产业链划分方法无法满足实际需求，为了能够准确地对企业进行产业链位置划分，本发明采用加权评分方式，其中，分类算法采用随机森林模型，评分方法采用指数加权平均法，大数据产业链模型研究所使用的数据集维度较多、数据量较大，随机森林算法能够有效地解决这些问题，其优点主要体现在：In the industrial analysis system based on big data mentioned above, the industry chain division standards of enterprises are relatively vague, and a company may exist in multiple positions in the industry chain at the same time. Therefore, a single industry chain division method cannot meet actual needs. In order to accurately divide the position of the industrial chain of enterprises, the present invention adopts a weighted scoring method, wherein the classification algorithm adopts the random forest model, the scoring method adopts the exponential weighted average method, and the data set dimensions used in the research of the big data industrial chain model are more , The amount of data is large, the random forest algorithm can effectively solve these problems, and its advantages are mainly reflected in:

(1)在当前的很多数据集上，相对其他算法有很大优势，表现良好；(1) On many current data sets, it has great advantages over other algorithms and performs well;

(2)能够处理很高维度的数据，并且不需要做特征选择；(2) It can handle very high-dimensional data and does not require feature selection;

(3)训练速度快，容易做成并行化方法(训练时树与树之间相互独立)；(3) The training speed is fast, and it is easy to make a parallel method (the trees are independent of each other during training);

(4)在训练过程中，能够检测到不同维度数据间的相互影响；(4) During the training process, the interaction between data of different dimensions can be detected;

(5)在处理不平衡的数据集时，可以平衡误差；(5) When dealing with unbalanced data sets, errors can be balanced;

(6)如果有很大一部分的特征遗失，仍可以维持准确度；(6) If a large part of the features are missing, the accuracy can still be maintained;

产业链模型的工作流程如下：The workflow of the industrial chain model is as follows:

(1)开始；(1) start;

(2)数据集获取：获取企业的工商信息、知识产权、企业运营历史数据作为原始数据集，并构建大数据产业链分类指标，分别为：“资源”、“技术”、“应用”和“产业支撑”；(2) Data set acquisition: obtain the business information, intellectual property rights, and historical data of enterprise operations as the original data set, and construct the classification indicators of the big data industry chain, which are: "resources", "technology", "application" and " industry support";

(3)数据标注：将原始数据集依据大数据产业链分类指标来标注企业在产业链中的类别；(3) Data labeling: label the original data set according to the classification index of the big data industry chain to mark the category of the enterprise in the industry chain;

(4)数据预处理：对原始数据集中的数据进行数据匹配及异常值去除操作；(4) Data preprocessing: perform data matching and outlier removal operations on the data in the original data set;

(5)数据集划分：将原始数据集中的数据按照3:1的比例进行划分训练集与测试集；(5) Data set division: divide the data in the original data set into a training set and a test set according to a ratio of 3:1;

(6)构建随机森林：在训练集上应用传统随机森林算法构建用于预测企业在产业链中位置的随机森林；(6) Build a random forest: apply the traditional random forest algorithm on the training set to build a random forest for predicting the position of the enterprise in the industrial chain;

(7)随机森林模型训练：利用训练集中的数据训练N棵决策树的随机森林模型，N为大于1的整数，每棵决策树都随机的从训练集中随机抽取企业数据进行训练，采用增益熵来选择合适的属性节点，每棵树从训练集中随机抽取样本和属性特征来生成各自的节点，直到所有决策树把自己抽样出的样本分类完；(7) Random forest model training: use the data in the training set to train the random forest model of N decision trees, N is an integer greater than 1, and each decision tree randomly extracts enterprise data from the training set for training, using gain entropy To select the appropriate attribute nodes, each tree randomly selects samples and attribute features from the training set to generate their own nodes until all decision trees classify the samples they have sampled;

(8)模型评估与校正：将测试集输入训练好的随机森林模型进行分类，将分类结果与实际结果进行统计，并计算预测准确率，分类结果与实际结果都为企业在产业链中的类别，当预测准确率小于设定值(设定值可根据实际需求进行设定)时，计算每棵决策树得到的分类结果并计算其AUC值，基于AUC值从目前的随机森林模型中提取出相对高精度的决策树集合，再根据相似性对其进行聚类，划分为不同的类簇，最后从不同类簇中选取精度高的决策树集合来迭代更新现有随机森林模型；当预测准确率大于等于设定值，不对随机森林模型进行更新；(8) Model evaluation and correction: Input the test set into the trained random forest model for classification, count the classification results and actual results, and calculate the prediction accuracy rate. Both the classification results and the actual results are the categories of enterprises in the industrial chain , when the prediction accuracy rate is less than the set value (the set value can be set according to actual needs), calculate the classification result obtained by each decision tree and calculate its AUC value, based on the AUC value extracted from the current random forest model A set of relatively high-precision decision trees is clustered according to similarity and divided into different clusters. Finally, a set of high-precision decision trees is selected from different clusters to iteratively update the existing random forest model; when the prediction is accurate If the rate is greater than or equal to the set value, the random forest model will not be updated;

(9)获取待分类企业的数据；(9) Obtain the data of the enterprises to be classified;

(10)对数据进行数据匹配及异常值去除操作；(10) Perform data matching and outlier removal operations on the data;

(11)将数据输入到训练好的随机森林模型，由其输出分类结果；(11) Input the data into the trained random forest model, and output the classification result;

(12)结束。(12) END.

如上所述的一种基于大数据的产业分析系统，企业绩效评估模型的流程如下：As mentioned above, an industrial analysis system based on big data, the process of the enterprise performance evaluation model is as follows:

(1)开始；(1) start;

(2)分别对企业综合能力、研发能力、行业影响力、持续运营能力和数据应用能力进行打分；(2) Score the enterprise's comprehensive ability, research and development ability, industry influence, sustainable operation ability and data application ability;

综合能力得分a＝a1+a2+a3，当注册资本>50000时，a1＝100；当注册资本<100时，a1＝45；当100≤注册资本≤50000时，a1＝注册资本^1/8×25；当企业为上市公司时，a2在区间[90,100]内随机取值；当企业为未上市公司时，a2在区间[50,60]内随机取值；当主营业务收入>50000时，a3＝100；当主营业务收入<100时，a3＝45；当100≤主营业务收入≤50000时，a3＝主营业务收入^1/8×25；Comprehensive ability score a=a1+a2+a3, when the registered capital>50000, a1=100; when the registered capital<100, a1=45; when 100≤registered capital≤50000, a1=registered capital ^1/8 × 25; when the company is a listed company, a2 is randomly selected in the interval [90,100]; when the enterprise is an unlisted company, a2 is randomly selected in the interval [50,60]; when the main business income>50000, a3=100; when the main business income <100, a3=45; when 100≤main business income≤50000, a3=main business income1 ^/8 ×25;

研发能力得分b＝b1+b2+b3+b4，当研发资金>10000时，b1＝100；当研发资金<10时，b1＝45；当10≤研发资金≤10000时，b1＝研发资金^1/8×31；当研发人数>1000时，b2＝100；当研发人数<5时，b2＝50；当5≤研发人数≤1000时，b2＝研发人数^1/8×42；当专利数量>50时，b3＝100；当专利数量<2时，b3＝45；当2≤专利数量≤50时，b3＝专利数量^1/4×38；当软著数量≥5时，b4＝100；当软著数量＝4时，b4＝90；当软著数量＝3时，b4＝80；当软著数量＝2时，b4＝70；当软著数量＝1时，b4＝60；当软著数量＝0时，b4＝50；R&D capability score b=b1+b2+b3+b4, when R&D funds>10000, b1=100; when R&D funds<10, b1=45; when 10≤R&D funds≤10000, b1=R&D funds1 ^{/ 8} × 31; when the number of R&D personnel > 1000, b2 = 100; when the number of R&D personnel < 5, b2 = 50; when 5 ≤ the number of R&D personnel ≤ 1000, b2 = ^1/8 × 42 of the number of R&D personnel; when the number of patents > 50 , b3=100; when the number of patents<2, b3=45; when 2≤the number of patents≤50, b3= ^1/4 ×38 of the number of patents; when the number of soft deals ≥5, b4=100; When the number of moves = 4, b4 = 90; when the number of soft moves = 3, b4 = 80; when the number of soft moves = 2, b4 = 70; when the number of soft moves = 1, b4 = 60; =0, b4=50;

行业影响力得分c＝c1+c2，当企业为上市公司时，c1＝60+主营业务收入^1/8×10；当企业为非上市公司时，c1＝40+主营业务收入^1/8×10；当大数据业务收入>20000时，c2＝100；当大数据业务收入<50时，c2＝45；当50≤大数据业务收入≤20000时，c2＝大数据业务收入^1/8×29；Industry influence score c = c1 + c2, when the company is a listed company, c1 = 60 + main business income ^1/8 × 10; when the company is a non-listed company, c1 = 40 + main business income ^1/8 ×10; when big data business revenue >20,000, c2=100; when big data business revenue <50, c2=45; when 50≤big data business revenue≤20,000, c2=big data business revenue ^1/8 × 29;

持续运营能力得分d＝d1+d2+d3，d1＝市场占有率×0.5+经济效益×0.25+主营业务收入×0.25；当风险投资与利用外资之和>10000时，d2＝100；当风险投资与利用外资之和<1时，d2＝30；当1≤风险投资与利用外资之和≤10000时，d2＝(风险投资+利用外资)^1/8×32；当营业利润>10000时，d3＝100；当营业利润<1时，d3＝30；当1≤营业利润≤10000时，d3＝营业利润^1/8×32；Sustainability score d=d1+d2+d3, d1=market share×0.5+economic benefits×0.25+main business income×0.25; when the sum of venture capital and foreign capital utilization>10000, d2=100; when risk When the sum of investment and utilization of foreign capital <1, d2=30; when 1≤the sum of venture capital and utilization of foreign capital≤10,000, d2=(venture investment + utilization of foreign capital) ^1/8 ×32; when operating profit>10,000, d3=100; when operating profit<1, d3=30; when 1≤operating profit≤10000, d3=operating profit ^1/8 ×32;

数据应用能力得分e＝e1+e2+e3，当数据软资产>2000时，e1＝100；当数据软资产<10时，e1＝45；当10≤数据软资产≤2000时，e1＝数据软资产^1/8×38；当数据硬资产>2000时，e2＝100；当数据硬资产<10时，e2＝45；当10≤数据硬资产≤2000时，e2＝数据硬资产^1/8×38；当数据产品成交额>10000时，e3＝100；当数据产品成交额<10时，e3＝45；当10≤数据产品成交额≤10000时，e3＝数据产品成交额^1/8×32；Data application ability score e=e1+e2+e3, when data soft assets>2000, e1=100; when data soft assets<10, e1=45; when 10≤data soft assets≤2000, e1=data soft assets Assets ^1/8 × 38; when data hard assets > 2000, e2 = 100; when data hard assets < 10, e2 = 45; when 10≤data hard assets≤2000, e2=data hard assets ^1/8 × 38; when the turnover of data products > 10000, e3 = 100; when the turnover of data products < 10, e3 = 45; when 10≤ turnover of data products ≤ 10000, e3 = turnover of data products ^1/8 × 32 ;

其中，注册资本、主营业务收入、研发资金、大数据业务收入、经济效益、风险投资、利用外资、营业利润、数据软资产、数据硬资产和数据产品成交额的单位为万元人民币；研发人数的单位为人，专利数量和软著数量的单位为个，市场占有率的单位为％(以当前数据集中同类企业总和计算)；Among them, the unit of registered capital, main business income, research and development funds, big data business income, economic benefits, venture capital, utilization of foreign capital, operating profit, data soft assets, data hard assets and data product turnover is RMB 10,000; The unit of the number of people is person, the unit of the number of patents and the number of soft works is unit, and the unit of market share is % (calculated based on the sum of similar enterprises in the current data set);

(3)计算加权总分g得到企业绩效，g＝w1×a+w2×b+w3×c+w4×d+w5×e，w1＝0.3，w2＝0.25，w3＝0.1，w4＝0.15，w5＝0.2；(3) Calculate the weighted total score g to obtain enterprise performance, g=w1×a+w2×b+w3×c+w4×d+w5×e, w1=0.3, w2=0.25, w3=0.1, w4=0.15, w5=0.2;

(4)结束。(4) END.

如上所述的一种基于大数据的产业分析系统，数据基础平台模块提供大数据存储、计算、分析、可视化、日常运维服务支撑，数据基础平台包括数据交换平台、大数据平台、信息发布平台、GIS地理信息平台、基础组件平台、报表分析平台以及系统管理平台。As mentioned above, an industrial analysis system based on big data, the data basic platform module provides big data storage, calculation, analysis, visualization, daily operation and maintenance service support, and the data basic platform includes a data exchange platform, a big data platform, and an information release platform , GIS geographic information platform, basic component platform, report analysis platform and system management platform.

有益效果：Beneficial effect:

本发明通过建设一种基于大数据的产业分析系统，把某一产业相关经济数据汇聚到一起，在此基础上进行适时监测调度，及时发现问题、及时预警、及时应对，以保持经济的平稳运行，在此基础上，结合海量历史数据变化，通过调整相关因子变量，对某一产业发展趋势预测，最终为政府科学制定产业发展扶持政策提供量化依据。The present invention builds an industrial analysis system based on big data, gathers relevant economic data of a certain industry together, conducts timely monitoring and scheduling on this basis, discovers problems in time, warns in time, and responds in time to maintain the stable operation of the economy , on this basis, combined with the changes in massive historical data, by adjusting the relevant factor variables, the development trend of a certain industry is predicted, and finally provides a quantitative basis for the government to scientifically formulate industrial development support policies.

附图说明Description of drawings

图1为基于大数据的产业分析系统的框架图；Figure 1 is a framework diagram of an industrial analysis system based on big data;

图2为请求响应模式的工作流程图；Fig. 2 is the working flowchart of request response mode;

图3为消息推送模式的工作流程图；Fig. 3 is the working flowchart of message push mode;

图4为产业分类模型的结构示意图；Figure 4 is a schematic structural diagram of the industry classification model;

图5为产业分类模型的建立流程图；Fig. 5 is the establishment flowchart of industry classification model;

图6和图7为产业链模型的工作流程图；Fig. 6 and Fig. 7 are the work flowchart of industrial chain model;

图8为企业绩效评估模型的结构示意图。Figure 8 is a schematic structural diagram of an enterprise performance evaluation model.

具体实施方式Detailed ways

下面结合具体实施方式，进一步阐述本发明。应理解，这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解，在阅读了本发明讲授的内容之后，本领域技术人员可以对本发明作各种改动或修改，这些等价形式同样落于本申请所附权利要求书所限定的范围。The present invention will be further described below in combination with specific embodiments. It should be understood that these examples are only used to illustrate the present invention and are not intended to limit the scope of the present invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

一种基于大数据的产业分析系统，如图1所示，由产业发展相关数据库模块、数据分析模型模块、数据基础平台模块和用户端界面模块组成。An industrial analysis system based on big data, as shown in Figure 1, consists of an industrial development-related database module, a data analysis model module, a data foundation platform module, and a user interface module.

产业发展相关数据库模块用于存储某产业发展相关数据资源，包括与某产业发展相关的行业数据、企业数据、区域数据和技术数据；产业发展相关数据库模块通过数据治理，形成高度可用的数据资产，面向数据基础平台模块和数据分析模型模块提供支撑，支持数据查询和业务分析的需要；The database module related to industrial development is used to store data resources related to the development of an industry, including industry data, enterprise data, regional data and technical data related to the development of a certain industry; the database module related to industrial development forms highly available data assets through data governance, Provide support for the data base platform module and data analysis model module, and support the needs of data query and business analysis;

(2.2.3)REST接口，REST接口协议是WebService(Web服务)的一种实现方式，主要应用于系统间的接口实现，移动警务平台中，除了在设备管理层面上使用通用的SNMP、SYSLOG协议以外，软件系统间、策略配置下发等场景应使用REST接口协议实现，一般情况下，应采用请求响应模式实现监测信息报送、策略与指令下发等，请求响应模式的工作流程如图2所示，请求与响应通信协议应采用HTTP1.1 over SSL/TLS，请求响应模式应使用HTTP作为接口实现协议，接口参数、返回结果均使用JSON对象，数据包大小不超过5MB，在网络反向不可达的情况下，可采用消息推送模式实现策略指令下发，具体流程如图3所示，通信协议应采用HTTP2.0的WEBSOCKET over SSL/TLS，应使用HTTP作为接口实现协议，接口参数、返回结果均使用JSON对象，数据包大小不超过5MB；(2.2.3) REST interface, the REST interface protocol is an implementation of WebService (Web service), which is mainly used to realize the interface between systems. In the mobile police platform, in addition to using general SNMP and SYSLOG on the equipment management level In addition to protocols, scenarios such as inter-software systems and policy configuration delivery should be implemented using the REST interface protocol. In general, the request-response mode should be used to implement monitoring information reporting, policy and instruction delivery, etc. The workflow of the request-response mode is shown in the figure As shown in 2, the request and response communication protocol should use HTTP1.1 over SSL/TLS, the request response mode should use HTTP as the interface implementation protocol, the interface parameters and return results should use JSON objects, and the data packet size should not exceed 5MB. In the case of unreachable, the message push mode can be used to implement the delivery of policy instructions. The specific process is shown in Figure 3. The communication protocol should use HTTP2.0 WEBSOCKET over SSL/TLS, and HTTP should be used as the interface to implement the protocol. Interface parameters , The returned results all use JSON objects, and the data packet size does not exceed 5MB;

(2.3.3)企业相关数据清点与梳理服务、接口协商与定制服务。(2.3.3) Enterprise-related data inventory and combing services, interface negotiation and customization services.

数据分析模型模块用于存储数据分析模型；数据分析模型包括产业分类模型、产业链模型和企业绩效评估模型；The data analysis model module is used to store data analysis models; the data analysis models include industry classification models, industry chain models and enterprise performance evaluation models;

产业分类模型用于根据待分类企业的企业经营范围，确定企业的产业分类；产业分类模型结构相对简单，采用决策树算法，决策树分类很符合人类分类时的思想，决策树分类时会提出很多不同的问题，判断样本的某个特征，然后综合所有的判断结果给出样本的类别，产业分类模型的结构如图4所示，产业分类模型的建立流程(见图5)如下：The industry classification model is used to determine the industry classification of the enterprise according to the business scope of the enterprise to be classified; the structure of the industry classification model is relatively simple, and the decision tree algorithm is used. The decision tree classification is in line with the thinking of human classification. For different problems, judge a certain feature of the sample, and then combine all the judgment results to give the category of the sample. The structure of the industry classification model is shown in Figure 4. The establishment process of the industry classification model (see Figure 5) is as follows:

(1)开始；(1) start;

(3)创建节点；(3) Create a node;

(1)开始；(1) start;

(4)结束；(4) end;

产业链模型用于通过加权评分法，根据待分类企业的数据，在“资源”、“技术”、“应用”、“产业支撑”四个类别中的评分，确定企业在产业链中的位置，即在产业链中的类别；企业的产业链划分标准相对模糊，一家企业可能同时存在于产业链中的多个位置，因此，单一的产业链划分方法无法满足实际需求，为了能够准确地对企业进行产业链位置划分，本发明采用加权评分方式，其中，分类算法采用随机森林模型，评分方法采用指数加权平均法，大数据产业链模型研究所使用的数据集维度较多、数据量较大，随机森林算法能够有效地解决这些问题，其优点主要体现在：The industrial chain model is used to determine the position of the enterprise in the industrial chain through the weighted scoring method, according to the data of the enterprises to be classified, and the scores in the four categories of "resources", "technology", "applications" and "industrial support". That is, the categories in the industrial chain; the division standard of the industrial chain of enterprises is relatively vague, and an enterprise may exist in multiple positions in the industrial chain at the same time. Therefore, a single industrial chain division method cannot meet the actual needs. To divide the position of the industrial chain, the present invention adopts a weighted scoring method, wherein the classification algorithm adopts the random forest model, and the scoring method adopts the exponential weighted average method. The data sets used in the big data industrial chain model research have more dimensions and a larger amount of data. The random forest algorithm can effectively solve these problems, and its advantages are mainly reflected in:

(1)开始；(1) start;

(2)数据集获取：获取企业的工商信息、知识产权、企业运营历史数据作为原始数据集，并构建大数据产业链分类指标，分别为：“资源”、“技术”、“应用”和“产业支撑；(2) Data set acquisition: obtain the business information, intellectual property rights, and historical data of enterprise operations as the original data set, and construct the classification indicators of the big data industry chain, which are: "resources", "technology", "application" and " industry support;

(12)结束；(12) end;

步骤(1)～(8)对应图6，步骤(9)～(12)对应图7；Steps (1)-(8) correspond to Figure 6, and steps (9)-(12) correspond to Figure 7;

企业绩效评估模型用于根据与企业综合能力、研发能力、行业影响力、持续运营能力和数据应用能力相关的数据(即企业的注册资本、上市情况、业务收入、研发投入等数据)对企业综合能力、研发能力、行业影响力、持续运营能力和数据应用能力进行打分后，计算加权总分得到企业绩效；企业绩效评估模型的结构如图8所示，企业绩效评估模型的流程如下：The enterprise performance evaluation model is used to comprehensively evaluate the enterprise based on the data related to the enterprise's comprehensive ability, research and development ability, industry influence, continuous operation ability and data application ability (that is, the enterprise's registered capital, listing status, business income, research and development investment, etc.) After scoring capabilities, research and development capabilities, industry influence, sustainable operation capabilities, and data application capabilities, the weighted total score is calculated to obtain enterprise performance; the structure of the enterprise performance evaluation model is shown in Figure 8, and the process of the enterprise performance evaluation model is as follows:

(1)开始；(1) start;

(4)结束。(4) END.

数据基础平台模块提供大数据存储、计算、分析、可视化、日常运维服务支撑，数据基础平台包括数据交换平台、大数据平台、信息发布平台、GIS地理信息平台、基础组件平台、报表分析平台以及系统管理平台；The basic data platform module provides big data storage, calculation, analysis, visualization, and daily operation and maintenance service support. The basic data platform includes data exchange platform, big data platform, information release platform, GIS geographic information platform, basic component platform, report analysis platform and System management platform;

(d)可视化动态图表分析：为用户提供拖拽式、符合国人操作习惯的图表展示方式。(d) Visual dynamic chart analysis: provide users with a drag-and-drop chart display method that conforms to the operating habits of Chinese people.

数据分析模型模块还用于在相关数据输入到相关模型中后，输出分析结果。The data analysis model module is also used to output analysis results after the relevant data is input into the relevant model.

1)HTML1)HTML

2)CSS(层叠样式表)2) CSS (Cascading Style Sheets)

3)JavaScript3) JavaScript

4)超链接4) hyperlink

5)表单5) Form

接收用户输入；receive user input;

6)文档对象模型DOM6) Document Object Model DOM

9)同源策略9) Same Origin Policy

Claims

1. An industrial analysis system based on big data, characterized in that it includes an industrial development-related database module, a data analysis model module, a data foundation platform module and a user interface module;

The industry development related database module is used to store data resources related to an industry development;

The data analysis model module is used to store the data analysis model; the data analysis model includes the industry classification model, the industry chain model and the enterprise performance evaluation model; the industry classification model is used to determine the industry classification of the enterprise according to the business scope of the enterprise to be classified; the industry chain The model is used to determine the position of the enterprise in the industrial chain through the weighted scoring method, according to the data of the enterprise to be classified, in the four categories of "resources", "technology", "application" and "industrial support". The category in the industrial chain; the enterprise performance evaluation model is used to evaluate the enterprise's comprehensive ability, research and development ability, industry influence, and sustainable operation ability according to the data related to the enterprise's comprehensive ability, research and development ability, industry influence, continuous operation ability and data application ability. After scoring with the data application ability, calculate the weighted total score to get the enterprise performance;

The establishment process of the industry classification model is as follows:

(1) start;

(2) Create a tree with the enterprise dataset as the root node;

(3) Create a node;

(4) Determine whether the enterprise data set is empty, if yes, return to the previous node, and end; otherwise, enter the next step;

(5) Determine whether the current node data set belongs to the same attribute, if so, record it as a leaf node and mark it as class C, and end; otherwise, go to the next step;

(6) Determine whether the candidate attribute set is empty, if it is, record it as the class C with the largest number of samples in S, and end; otherwise, go to the next step;

(7) Calculate the information gain rate of each enterprise condition attribute in the set;

(8) Select the largest information gain in the candidate set as the segmentation attribute of the current node;

(9) Determine the enterprise data set according to the value of the segmentation attribute, and establish the corresponding branch;

(10) Continuously recursively run the function on the enterprise data set, and return to step (2);

The workflow of the industry classification model is as follows:

(1) start;

(2) Input the business scope data of the enterprise to be classified;

(3) Industry classification of enterprises through decision tree algorithm;

(4) end;

The workflow of the industrial chain model is as follows:

(1) start;

(2) Acquisition of data sets: Obtain the business information, intellectual property rights, and historical data of enterprise operations as the original data sets, and construct the classification indicators of the big data industry chain, which are: "resources", "technology", "applications" and " industry support;

(3) Data labeling: mark the original data set according to the classification index of the big data industry chain to mark the category of the enterprise in the industry chain;

(4) Data preprocessing: perform data matching and outlier removal operations on the data in the original data set;

(5) Data set division: divide the data in the original data set into training set and test set according to the ratio of 3:1;

(6) Build a random forest: apply the traditional random forest algorithm on the training set to build a random forest for predicting the position of the enterprise in the industrial chain;

(7) Random forest model training: use the data in the training set to train the random forest model of N decision trees, N is an integer greater than 1, and each decision tree randomly extracts enterprise data from the training set for training, using gain entropy To select the appropriate attribute nodes, each tree randomly selects samples and attribute features from the training set to generate their own nodes until all decision trees classify the samples they have sampled;

(8) Model evaluation and correction: Input the test set into the trained random forest model for classification, count the classification results and actual results, and calculate the prediction accuracy rate. Both the classification results and the actual results are the categories of enterprises in the industrial chain , when the prediction accuracy rate is less than the set value, calculate the classification result obtained by each decision tree and calculate its AUC value, and extract a relatively high-precision decision tree set from the current random forest model based on the AUC value, and then based on the similarity Cluster it, divide it into different clusters, and finally select a high-precision decision tree set from different clusters to iteratively update the existing random forest model; when the prediction accuracy is greater than or equal to the set value, the random forest model will not be renew;

(9) Obtain the data of the enterprises to be classified;

(10) Perform data matching and outlier removal operations on the data;

(11) Input the data into the trained random forest model, and output the classification result;

(12) end;

The process of enterprise performance evaluation model is as follows:

(1) start;

(2) Score the enterprise's comprehensive ability, research and development ability, industry influence, sustainable operation ability and data application ability;

Comprehensive ability score a=a1+a2+a3, when the registered capital>50000, a1=100; when the registered capital<100, a1=45; when 100≤registered capital≤50000, a1=registered capital ^1/8 × 25; when the company is a listed company, a2 is randomly selected in the interval [90,100]; when the enterprise is an unlisted company, a2 is randomly selected in the interval [50, 60]; when the main business income>50000, a3=100; when the main business income<100, a3=45; when 100≤main business income≤50000, a3=main business income1 ^/8 ×25;

R&D capability score b=b1+b2+b3+b4, when R&D funds>10000, b1=100; when R&D funds<10, b1=45; when 10≤R&D funds≤10000, b1=R&D funds1 ^{/ 8} × 31; when the number of R&D personnel > 1000, b2 = 100; when the number of R&D personnel < 5, b2 = 50; when 5 ≤ the number of R&D personnel ≤ 1000, b2 = the number of R&D personnel ^1/8 × 42; when the number of patents > 50 , b3=100; when the number of patents<2, b3=45; when 2≤the number of patents≤50, b3= ^1/4 ×38 of the number of patents; when the number of soft deals ≥5, b4=100; When the number of soft moves=4, b4=90; when the number of soft moves=3, b4=80; when the number of soft moves=2, b4=70; when the number of soft moves=1, b4=60; when the number of soft moves=1, b4=60; =0, b4=50;

Industry influence score c=c1+c2, when the company is a listed company, c1=60+main business income ^1/8 ×10; when the company is a non-listed company, c1=40+main business income ^1/8 ×10; when big data business revenue >20,000, c2=100; when big data business revenue <50, c2=45; when 50≤big data business revenue≤20,000, c2=big data business revenue ^1/8 × 29;

Continuous operation ability score d=d1+d2+d3, d1=market share×0.5+economic benefits×0.25+main business income×0.25; when the sum of venture capital and foreign capital utilization>10000, d2=100; when the risk When the sum of investment and utilization of foreign capital <1, d2=30; when 1≤the sum of venture capital and utilization of foreign capital≤10,000, d2=(venture investment + utilization of foreign capital) ^1/8 ×32; when operating profit>10,000, d3=100; when operating profit<1, d3=30; when 1≤operating profit≤10000, d3=operating profit ^1/8 ×32;

Data application ability score e=e1+e2+e3, when data soft assets>2000, e1=100; when data soft assets<10, e1=45; when 10≤data soft assets≤2000, e1=data soft assets Assets ^1/8 × 38; when data hard assets > 2000, e2 = 100; when data hard assets < 10, e2 = 45; when 10 ≤ data hard assets ≤ 2000, e2 = data hard assets ^1/8 × 38; when the turnover of data products is >10,000, e3=100; when the turnover of data products is <10, e3=45; when 10≤the turnover of data products≤10,000, e3= ^1/8 of the turnover of data products ×32 ;

Among them, the unit of registered capital, main business income, research and development funds, big data business income, economic benefits, venture capital, utilization of foreign capital, operating profit, data soft assets, data hard assets and data product turnover is RMB 10,000; The unit of the number of people is person, the unit of the number of patents and the number of soft works is unit, and the unit of market share is %;

(3) Calculate the weighted total score g to get the enterprise performance, g=w1×a+w2×b+w3×c+w4×d+w5×e, w1=0.3, w2=0.25, w3=0.1, w4=0.15, w5=0.2;

(4) end;

The data base platform module is connected with the industrial development-related database module and the data analysis model module at the same time, and is used to retrieve relevant data from the industrial development-related database module according to the analysis objectives, and then transfer the relevant models from the data analysis model module, and transfer the relevant data input into the relevant model;

The data analysis model module is also used to output the analysis results after the relevant data is input into the relevant model;

The user terminal interface module is connected with the data analysis model module for displaying analysis results.

2. An industry analysis system based on big data according to claim 1, wherein the data resources related to the development of a certain industry include industry data, enterprise data, regional data and technical data related to the development of a certain industry.

3. A kind of industrial analysis system based on big data according to claim 1, characterized in that, the basic data platform includes a data exchange platform, a big data platform, an information release platform, a GIS geographic information platform, a basic component platform, and a report analysis platform platform and system management platform.