CN108229810B - Industry analysis system and method based on network information resources - Google Patents

Industry analysis system and method based on network information resources Download PDF

Info

Publication number
CN108229810B
CN108229810B CN201711475066.6A CN201711475066A CN108229810B CN 108229810 B CN108229810 B CN 108229810B CN 201711475066 A CN201711475066 A CN 201711475066A CN 108229810 B CN108229810 B CN 108229810B
Authority
CN
China
Prior art keywords
data
module
industry
network information
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711475066.6A
Other languages
Chinese (zh)
Other versions
CN108229810A (en
Inventor
张海东
倪晚成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Luoyang Robot And Intelligent Equipment Research Institute
Institute of Automation of Chinese Academy of Science
Original Assignee
Innovation Institute For Robot And Intelligent Equipment (luoyang) Casia
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innovation Institute For Robot And Intelligent Equipment (luoyang) Casia, Institute of Automation of Chinese Academy of Science filed Critical Innovation Institute For Robot And Intelligent Equipment (luoyang) Casia
Priority to CN201711475066.6A priority Critical patent/CN108229810B/en
Publication of CN108229810A publication Critical patent/CN108229810A/en
Application granted granted Critical
Publication of CN108229810B publication Critical patent/CN108229810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0637Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Tourism & Hospitality (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Marketing (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of information analysis, provides an industry analysis system based on network information resources, and aims to solve the problems that a large amount of manpower and material resources are required to be consumed for industry information analysis, and real-time performance cannot be achieved. The system comprises: the system comprises a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to the industry; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with platform data and construct an industrial structure tree; the data analysis module is configured to analyze the platform data through a natural language processing technology and a data mining algorithm and extract data related to the keywords as interactive data; and the foreground interaction module is configured to interact with the user terminal through the interaction data. The invention realizes the mining of valuable data from massive network information and presents the analysis result of the industry for users in real time.

Description

Industry analysis system and method based on network information resources
Technical Field
The invention relates to the field of computer network information application, in particular to the field of data mining application of network information resources, and particularly relates to an industry analysis system and method based on network information resources.
Background
With the rapid development of information technology, information data in various fields show explosive growth, huge challenges and pressure are brought to workers in the industries, valuable industry information is mined from the massive data, changes of the industry information are tracked in real time, development trends of industry upstream and downstream branch workers and competitors are known, an industry management layer and a decision layer are assisted to make rapid and effective response strategies according to market changes, and the method has important reference significance.
The industry analysis is a systematic industry information integration analysis result, and has important reference significance for enterprises to find industry business opportunities, grasp market pulse, evaluate investment risk and the like. The relevant data is collected and combined with relevant working experience to perform industry analysis report, usually by market research companies within enterprises or specialties. Because the industry analysis report needs to be researched and compiled, a large amount of manpower and material resources are consumed, and the real-time performance cannot be achieved, which is in great contrast with the information era of immense change.
Disclosure of Invention
In order to solve the problems in the prior art, namely to solve the problems that a large amount of manpower and material resources are consumed and the real-time performance cannot be achieved due to the fact that an industry analysis report needs to be compiled after investigation, the following technical scheme is adopted to solve the problems:
in a first aspect, the present application provides an industry analytics system based on network information resources, the system comprising: the system comprises a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to industries concerned by users; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with preset platform data, and construct an association relationship between a domain knowledge tree of an industrial structure and a domain knowledge tree node of the industrial structure; the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the industry as interactive data; the foreground interaction module is configured to interact with the user terminal through the interaction data.
In some examples, the data collection module includes a vertical web crawler and an academic web crawler, the vertical web crawler configured to crawl web page information from an industry vertical website by analyzing uniform resource locators according to a preset first initial seed node; the academic web crawler is configured to capture academic articles from the academic website according to a preset second initial seed node.
In some examples, the data preprocessing module includes a data structuring sub-module, a platform data sub-module, a domain term extraction sub-module, and a domain knowledge tree sub-module, and the data structuring sub-module is configured to perform a structured analysis on the vertical web page information collected by the vertical web crawler; the platform data submodule is configured to store platform users and collected network information data and provide data for the analysis module; the domain term extracting submodule is configured to extract domain-related terms from the academic articles crawled by the academic web crawler; the domain knowledge tree submodule is configured to combine domain expert knowledge, perform structured organization on the extracted domain terms, construct a domain knowledge tree of an industrial structure, and analyze industrial association relations among nodes of the domain knowledge tree.
In some examples, the domain term extraction sub-module is further configured to analyze the academic articles obtained by the academic web crawler, analyze word frequencies in titles, keywords and abstracts of the articles by using a text analysis method, and extract domain professional terms.
In some examples, the data analysis module includes an entity identification submodule configured to construct an entity identification feature by text segmentation, part-of-speech tagging, and syntactic analysis, and to integrate conditional random fields and rule-based methods to identify a region entity, a place name entity, and a domain term entity contained in the platform data; the data mining submodule is configured to associate the identified entity with a domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze the association relationship among news data, company data and the domain knowledge tree, so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the operation data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
In some examples, the foreground interaction module includes a visualization sub-module and a map sub-module, where the visualization sub-module is configured to interact the result data analyzed by the data analysis module with a user in a manner of synthesizing a domain knowledge tree, a map, a line graph, a bar graph, and a list; the map sub-module is configured to present an area map of the selected area to the user.
In a second aspect, the present application provides an industry analysis method based on network information resources, including: collecting network information related to industries concerned by users; carrying out structuralization processing on the network information, fusing the network information with preset platform data, and constructing an industrial structure tree; analyzing the platform data through a natural language processing technology and a data mining algorithm, and extracting data related to the industry as interactive data; and interacting with the user terminal through the interaction data.
In some examples, the collecting network information related to industries includes collecting network information related to industries of interest of the user, including: according to a preset first initial seed node, capturing webpage information from an industry vertical website by analyzing a uniform resource locator contained in the first initial seed node by using a vertical web crawler; and capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node.
In some examples, the structuring of the network information is fused with preset platform data to construct a domain knowledge tree of an industrial structure, including structuring analysis of vertical web page information collected by a vertical web crawler; extracting domain-related terms from academic articles crawled by the academic web crawler; and combining domain expert knowledge, performing structured organization on the extracted domain terms and key technologies, constructing an industrial structure tree, and analyzing the industrial association relationship among the nodes of the structure tree.
In some examples, the extracting of domain-related terms from the academic articles crawled by the academic web crawlers includes: in order to analyze the academic articles acquired by the academic web crawler, the word frequency in the titles, keywords and abstracts of the articles is analyzed by using a text analysis algorithm, and domain professional terms are extracted.
In some examples, the analyzing the platform data through a natural language processing method and a data mining algorithm to extract data related to the industry as the interactive data includes: the method comprises the steps of constructing entity identification characteristics through text word segmentation, part-of-speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and identifying region entities, organization name entities and field term entities contained in platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among the news data, the company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
In some examples, the interacting with the user terminal through the interaction data includes: interacting the interactive data with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list; the user is presented with a map of the selected area.
According to the industry analysis system and method based on the network information resources, the data acquisition module acquires information related to the industry where the user is located, the information is subjected to structured processing through the data preprocessing module, a field knowledge tree of the industry is constructed, the preprocessed information is analyzed and mined through the data analysis module to obtain an analysis result of the industry information, and interaction is carried out with the user through the foreground interaction module. Valuable industry information is mined from mass data, changes of the industry information are tracked in real time, information of upstream and downstream branch workers and competitors of the industry is known, and an industry management layer and a decision layer are assisted to make a quick and effective coping strategy aiming at market changes.
Drawings
FIG. 1 is a schematic block diagram of an embodiment of an industry analytics system based on network information resources in accordance with the present application;
FIG. 2 is a basic framework diagram of a vertical web crawler crawling web page information flow in an embodiment of the present application;
FIG. 3 is a schematic diagram of an exemplary application of a Robotic industry chain knowledge tree built by a Domain knowledge Tree submodule in an embodiment of the present application;
FIG. 4a is a schematic diagram of upstream and downstream node relationships of an industry node constructed in an industry chain;
FIG. 4b is a schematic diagram of upstream and downstream node relationships of system integration industry nodes in a robotic industry chain constructed in the industry chain;
FIG. 5 is a diagram illustrating exemplary results of performing text segmentation, part-of-speech tagging, and syntactic analysis using a text analysis algorithm in an embodiment of the present application;
fig. 6 is a schematic diagram of an embodiment of an industry analysis method based on network information resources applied to the present application.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The industry analysis system based on the network information resources can comprise a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to industries concerned by users; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with preset platform data, and construct a domain knowledge tree of an industrial structure; the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the keywords as interactive data; the foreground interaction module is configured to interact with the user terminal through the interaction data.
In this embodiment, the data collection module collects industry-related network information according to keywords or key information provided by a user, where the industry-related information concerned by the user, such as information of enterprises in the same industry as the user and information of upstream and downstream enterprises, may be collected; information related to the development of the industry concerned by the user, such as the technical front, academic front and the like of the development of the industry can be collected.
The data preprocessing module is used for preprocessing the network information, the preprocessing can be used for structuring the network information related to the industry concerned by the user, and classified information such as companies related to the industry, products of the companies, company distribution areas, company purchase and the like can be extracted from the information; and classification frontier information can be established for industry development, and information such as industry development trend, technical development monitoring and the like can be obtained from the classification frontier information.
The data analysis module analyzes and mines the structured data and the classification information established by the data preprocessing module, and recommends the information of products, companies, news and the like which are possibly interested by the user to the user by combining the operation information of the user on the platform.
The foreground interaction module can interact with a user through an interaction interface, and the interaction interface can display information such as industry change trend, industry regional distribution, upstream and downstream analysis, competitors, potential buyers and the like through various chart forms; the user can intuitively obtain the industry information at any time and any place.
The system provided by the embodiment of the application analyzes and mines the information related to the industry concerned by the user, and displays the industry information obtained by analyzing and mining for the user.
Fig. 1 is a schematic diagram illustrating an exemplary system architecture of a specific embodiment of an industry analysis system based on network information resources to which the present application may be applied.
Specifically, as shown in fig. 1, a data acquisition module, a data preprocessing module, a data analysis module, and a foreground interaction module of the industry analysis system based on network information resources respectively implement data acquisition, data preprocessing, data analysis, and foreground interaction functions.
The data collection module includes a vertical web crawler 101 and an academic web crawler 102. The vertical web crawler 101 is configured to capture web page information from an industry vertical website by analyzing a Uniform Resource Locator (URL) according to a preset first initial seed node. Specifically, a representative website can be selected according to the industry as the first initial seed node of the vertical web crawler. The vertical web crawler 101 crawls web page information related to the industry concerned by the user by analyzing the URL of the website. The web page information includes news of related enterprises in the industry, organization information, products, purchase information and the like in the industry.
The academic web crawler 102 is configured to capture an academic article from an academic website according to a preset second initial seed node. Here, the academic conference and the academic journal based can be used as a second initial node of the academic web crawler. The academic web crawler 102 crawls related academic articles from academic websites, academic journals or academic conferences according to the second initial seed node, and obtains development front information of the industry.
The vertical web crawler 101 and the academic web crawler 102 can run according to actual requirements and fixed periods, and provide data support for real-time industry analysis. For example, because the industry information update frequency is high, the vertical web crawler 101 may operate once every 1 hour, so that the obtained information can achieve a real-time effect as much as possible; while the knowledge domain or academic domain may be updated less frequently, the academic web crawler 102 may be run daily or monthly.
As an example, FIG. 2 illustrates the flow of the above-described vertical web crawler 101 crawling for industry-related data.
Step 2.1: selecting a representative website as a seed URL of the module 201 according to an industry concerned by a user, wherein the initial seed of the web crawler is stored in the module 201;
step 2.2: pressing the seed URL of the module 201 into a URL queue to be captured of the module 202;
step 2.3: the module 203 reads the URL read from the URL queue to be captured, and filters the selected URL using the URL filter 204, specifically, the read URL is analyzed, and only the web page URL related to news and companies is reserved;
step 2.4: the downloader of block 205 crawls the filtered URL web pages from the vertical web site and saves the web page content using block 206;
step 2.5: saving the webpage of the module 206 in a webpage database 209, and simultaneously pressing the webpage URL which is crawled successfully into a grabbed URL queue 208 of the module 207;
step 2.6: using module 206 to extract URLs from the web page and filter crawled URLs therein, pushing non-crawled URLs to module 202;
step 2.7: and judging whether the queue of the module 202 has the webpage URL which is not grabbed, if so, skipping to the step 2.3, otherwise, finishing the crawling of the network crawler to the industry related data.
In this embodiment, the data preprocessing module includes a data structuring sub-module 103, a platform data sub-module 105, a domain term extracting sub-module 104, and a domain knowledge tree sub-module 106. The data structuring sub-module 103 is configured to parse and structure the web page content crawled by the vertical web crawler 101, merge the web page content with the platform data in the platform data sub-module 105, and provide basic data for further analysis and processing of the data. For the news web pages crawled by the vertical web crawler 101, the data structuring sub-module 103 extracts news titles, release times, web page contents, and the like of the news web pages by using web page parsing tools, such as beautilfugup, lxml, and the like. As an example, a news data table obtained by analyzing the news web pages crawled by the vertical web crawler 101 by using a web page analyzing tool is shown in table 1.
Table 1 a news data sheet is shown,
Figure GDA0002868393860000071
the enterprise organization information acquired by the vertical web crawler 101 includes information about companies related to enterprises, acquires company web content, analyzes the company web content using a web analysis tool, and extracts information about the name, address, product, purchase, company introduction, and the like of the company indicated by the company web content. By way of example, referring to table 2, table 2 shows a company data table obtained by analyzing the company web pages crawled by the vertical web crawler 101 by using a web page analyzing tool.
TABLE 2 company data
Figure GDA0002868393860000072
Figure GDA0002868393860000081
The domain term extraction sub-module 104 analyzes the academic articles acquired by the academic web crawler 102 by using a web page parsing tool, and further extracts domain terms of the academic articles. Since the title, the keyword, and the abstract contained in the academic article are abstracts of the core content of the academic article, the analysis of the academic article may be to analyze the title, the keyword, and the abstract of the academic article first, and then analyze the content of the academic article as needed. Various text analysis algorithms are embedded in the domain term extraction submodule 104, and the academic articles are analyzed by using the text analysis algorithms embedded in the domain term extraction submodule 104. Specifically, the above-mentioned analyzing the academic article by using the text analysis algorithm may be extracting a keyword of a text of the academic article by using a term frequency-inverse document frequency algorithm (TF-IDF) and a Latent semantic analysis algorithm (LDA), analyzing the term frequency in a title, the keyword and a summary of the academic article by using a clustering method, and extracting a term whose occurrence number is greater than a set threshold as a domain term of a child node in the domain knowledge tree. The term set formed by the domain terms of each child node in the domain knowledge tree can be used for analyzing the relationship between the network information data and the domain knowledge tree. As an example, referring to table 3, table 3 shows a data structure of a scholarly article, and the text analysis algorithm analyzes the scholarly article based on the contents shown in the data structure of table 3.
Table 3 data sheet of academic papers
Figure GDA0002868393860000082
Figure GDA0002868393860000091
The platform data sub-module 105 is configured to perform data analysis by the data analysis module to provide basic data and preprocessed data. The platform data sub-module 105 stores various information including user operation behaviors, company products, purchase requests, company news, company information, region information, and the like in the platform. The user operation behaviors are operation behaviors of the user in the system platform, such as browsing news, clicking products, issuing requirements and the like, and are used for tracking and recording behavior information of the user, analyzing user interests for algorithms and providing data support. The company product may be product information, such as product name, product profile, product function, product parameter, etc., published by a company user in the platform. The purchase demand can enable the user to publish purchase information in the platform, such as product name, parameters, price, limited area and the like. The company news can be news information released in the platform by company users, including news titles, authors, contents and the like. The company information may be registration information of a company user in the system platform, such as a company name, a registration address, a main business, and the like; the regional information can be Chinese geographic information constructed in a system platform, including provincial and urban full names, short names, longitude and latitude coordinates and areas, and is used for analyzing network information and positioning company position information.
The domain knowledge tree sub-module 106 is configured to construct an association between the domain knowledge tree and the domain knowledge tree nodes of the industrial structure in combination with the expert knowledge and the extracted domain expertise. The domain knowledge tree sub-module 106 may construct a domain knowledge tree of an industry according to the extracted data information of the industry in which the company user is located. Firstly, building industrial chain nodes which are respectively an industrial chain upstream node, an industrial chain midstream node and an industrial chain downstream node; then, respectively constructing an upstream node of the industrial chain, an intermediate node of the industrial chain and a child node of a downstream node of the industrial chain according to the webpage information and expert knowledge crawled by the web crawler; and finally, continuously taking each child node as an intermediate node, and constructing the child nodes of each intermediate node, thereby constructing a domain knowledge tree of the industry and industry chain where the company user is located. By way of example, FIG. 3 illustrates a domain knowledge tree of the robotic industry chain constructed by the domain knowledge tree sub-module 106 described above. In a robot industrial chain, the robot industrial chain is divided into an industrial chain upstream node, an industrial chain midstream node and an industrial chain downstream node. The upstream node of the industrial chain is a supplier and comprises child nodes such as raw materials, parts and the like; the industrial chain downstream nodes are after-sales services and applications, and comprise a partner sub-node, an agent sub-node, a third-party service sub-node, a solution sub-node and the like; the industrial chain midstream node is an industry main business, and comprises a robot body node and a robot integration node as a field tree trunk, wherein the robot integration node comprises a plurality of layers of sub-nodes, for example, the sub-nodes of the robot integration node comprise intelligent robot nodes, the sub-nodes of the intelligent robot nodes comprise industrial robot sub-nodes, and the industrial robot sub-nodes comprise a carrying robot sub-node and the like.
Fig. 4 shows a schematic diagram of the industrial nodes upstream and downstream of the industrial chain of robots constructed by the domain knowledge tree submodule 106. Fig. 4a shows the relationship between upstream and downstream nodes of each node, and fig. 4b shows a schematic diagram of a specific example of an industrial link point in a robot industrial chain, for example, in the robot industrial chain, when an industrial node is "system integration", the upstream industrial node includes a sensor, a controller, and the like, and the downstream industrial node includes a third party, an agent, and the like.
In this embodiment, the data analysis module includes an entity identification submodule 107 and a data mining submodule 108, where the entity identification submodule 107 is configured to construct an entity identification feature through text word segmentation, part-of-speech tagging and syntactic analysis, integrate a conditional random field and a rule-based method, and identify a region entity, a mechanism name entity, and a domain term entity included in platform data; the data mining submodule 108 is configured to associate the identified entity with the domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze an association relationship between the news data, the company data and the domain knowledge tree, so as to analyze a distribution condition and a variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
The entity identification submodule 107 includes six units, namely, text segmentation, part of speech tagging, syntactic analysis, region identification, organization name identification and domain term identification, and the text segmentation, the part of speech tagging and the syntactic analysis are used for constructing entity identification characteristics. Taking "the robot is a machine device that automatically executes work" as an example, text segmentation, part-of-speech tagging, and syntactic analysis are performed, and the result is shown in fig. 5, and the entity identification features extracted therefrom are shown in table 4; then, a Conditional Random Field (CRF) and a rule-based method are fused, and a region entity, an organization name entity and a domain term entity contained in each piece of information are found and identified.
Table 4 entity identification features
Figure GDA0002868393860000101
Figure GDA0002868393860000111
The data mining sub-module 108 uses a supervised learning algorithm to construct and identify the association between the entity and the industry node, analyze the association between news data, company data and the domain knowledge tree, count industry trend changes, industry regional distribution, and upstream and downstream analysis, and infer the industry node concerned by the user or the company by analyzing the data of the user or the company in the platform, such as information of issued products, purchasing and the like, and recommend the product, the company and the news interested by the user or the company.
In this embodiment, the foreground interaction module includes a visualization sub-module and a map sub-module, and the visualization sub-module is configured to interact the result data analyzed by the data analysis module with the user in a comprehensive manner through the domain knowledge tree 109, the line graph 111, the bar graph 112, and the list 113; the map sub-module is configured to present a map of the selected area as an area map of the area.
The visualization sub-module interacting with the user presents various analysis result information for the user in the modes of the domain knowledge tree 109, the line graph 111, the bar graph 112 and the list 113.
The domain knowledge tree 109 presents the domain knowledge tree structure of the industry concerned by the user for the user to select the industry node to view.
The map sub-module presents the area of each province and city of China for the user, and when a certain province and city is selected, the user can automatically jump to the province map of the province.
The line graph 111 shows the trend of the news popularity of an industrial node in a certain area along with the time for the user.
The histogram 112 presents the news popularity distribution of an industrial node in a certain area to the user.
The list 113 presents the information of the upstream and downstream companies, competitors and potential buyers, and recommended news to the user in a list.
The system provided by the embodiment of the application extracts information related to the industry where the user is located from the mass data through the data acquisition module; the data preprocessing module carries out data structuring processing on the extracted information; and constructing a domain knowledge tree. The data analysis module analyzes and mines the processed information, and analyzes the industry development trend by combining with expert knowledge, so as to provide an industry analysis report for the user; the foreground interaction module performs information interaction with a user and provides information related to industry for the user. The user can timely master the real-time change of each node of the industry, know the information of the upstream and downstream branch workers and competitors of the industry, and assist an industry management layer or a decision layer to make a quick and effective coping strategy aiming at the market change.
Referring to fig. 6, the present application provides an industry analysis method based on network information resources, which includes the following steps:
step 601, collecting network information related to industries concerned by users.
In this embodiment, an electronic device (which may be a server or an application platform) applied to the present application acquires industry-related network information from an industry-related website by using a web crawler. Here, the website related to the industry concerned by the user may be a website of a company in the industry and upstream and downstream industries where the user is or is engaged, and may also be a technical and academic forum or website related to the industry. The web crawler can be a vertical web crawler and can also be an academic web crawler. The vertical web crawler collects news, organizations, products and purchasing information from the related websites in the field. The academic web crawler captures related academic articles from academic conferences and academic journal websites related to the field. The network information may be news, organization, product and purchasing information, and may also be academic articles.
In some preferred embodiments, the network information related to the industries of interest of the users includes web page information and academic articles, and the collecting network information related to the industries of interest of the users includes: and according to a preset first initial seed node, utilizing a vertical web crawler to capture webpage information from an industry vertical website by analyzing the uniform resource locator of the first initial seed node. And capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node. Here, the first initial seed node is an initial seed node for selecting a representative website according to the industry as a web crawler. The second initial seed node may be based on academic conferences and academic journal works as the initial seed node. The web crawler crawls relevant web page information or academic articles by analyzing the URL.
And 602, performing structural processing on the network information, fusing the network information with preset platform data, and constructing a domain knowledge tree of an industrial structure.
In this embodiment, the server or the application platform performs data preprocessing on the network information to construct an industrial structure tree. Here, the data preprocessing may be structured analysis of vertical web page information collected by a vertical web crawler; and the method can also be used for extracting field related terms and key technologies from academic articles crawled by academic web crawlers, and combining with field expert knowledge to structurally organize the extracted field terms and key technologies, construct an industrial structure tree, and analyze the industrial association relationship among the nodes of the structure tree. Further, industry or industry related terms and key technical information are extracted from academic articles crawled by the academic web crawler, and domain professional terms can be extracted by analyzing the academic articles obtained by the academic web crawler and analyzing word frequency in article titles, keywords and abstracts by using a text analysis algorithm. The text analysis algorithm can be TF-IDF, LDA, clustering and other algorithms.
Step 603, analyzing the platform data and the field knowledge tree through a natural language processing method and a data mining algorithm, and extracting data related to industries as interactive data.
In this embodiment, a natural language processing method may be used to identify a regional entity, a domain term entity, and an organization name entity from network information such as news, companies, products, and purchasing; the data mining algorithm can be used for carrying out classification analysis on knowledge nodes of news, companies, products, purchasing and other information according to the relationship between the identified domain term entities and the domain knowledge tree nodes, statistics is carried out according to the regions where the information is located and the releasing time, and the trend change of the industry is tracked based on the news heat change of the knowledge nodes.
Further, analyzing the platform data through a natural language processing method and a data mining algorithm, extracting data related to industries as interactive data, constructing entity identification characteristics through text word segmentation, part of speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and identifying a region entity, a mechanism name entity and a domain term entity contained in the platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among the news data, the company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
And step 604, interacting with the user terminal through the interactive data.
In this embodiment, information interaction is performed with a user through an interactive application provided by an application platform. Here, the interactive application may be a visualization application, such as a line graph, a bar graph, a list, to display the analysis results. Specifically, the method comprises the following steps:
and displaying the industrial trend change of the selected domain knowledge tree nodes in the selected region range for the user by using the line graph.
And displaying the region distribution condition of the selected domain knowledge tree nodes in the selected region range for the user by using the histogram.
Using the list to present the upstream and downstream enterprise displays of the selected domain knowledge tree nodes in the selected region range for the user; using the list, recommending companies in which the user is interested for the user; using the list to recommend products of interest to the user; using the list, the user is recommended news of interest.
The method provided by the embodiment of the application can extract effective information from mass data, presents real-time changes of each node of the industry for users, knows division of labor and competitors of the upstream and downstream of the industry, assists an industry management layer, a decision layer and the like, and makes a quick and effective coping strategy according to market changes.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. An industry analysis system based on network information resources, the system comprising: a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein,
the data acquisition module is configured to acquire network information related to industries concerned by the user;
the data preprocessing module is configured to perform structured processing on the network information, fuse the network information with preset platform data, and construct an association relationship between a domain knowledge tree of an industrial structure and a domain knowledge tree node of the industrial structure;
the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the industry as interactive data; the data analysis module comprises an entity identification submodule and a data mining submodule, wherein the entity identification submodule is configured to construct entity identification characteristics through text word segmentation, part of speech tagging and syntactic analysis, and identify a region entity, a mechanism name entity and a field term entity contained in the platform data by fusing a conditional random field and a rule-based method; the data mining submodule is configured to associate the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze the association relationship between news data, company data and the domain knowledge tree, so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; reasoning about the industrial nodes concerned by the user according to the operation data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm;
and the foreground interaction module is configured to interact with the user terminal through the interaction data.
2. The industry analytics system based on network information resources of claim 1, wherein the data collection module comprises a vertical web crawler and an academic web crawler,
the vertical web crawler is configured to capture webpage information from an industry vertical website by analyzing a uniform resource locator according to a preset first initial seed node;
the academic web crawler is configured to capture academic articles from the academic website according to a preset second initial seed node.
3. The industry analytics system based on network information resources of claim 2, wherein the data pre-processing module comprises a data structuring sub-module, a platform data sub-module, a domain term extraction sub-module and a domain knowledge tree sub-module,
the data structuring sub-module is configured to perform structured analysis on vertical webpage information crawled by the vertical web crawler;
the platform data submodule is configured to store platform users and collected network information data and provide data for the analysis module;
the domain term extraction submodule is configured to extract domain-related terms from academic articles crawled by the academic web crawler;
the domain knowledge tree submodule is configured to combine domain expert knowledge, perform structured organization on the extracted domain terms, construct a domain knowledge tree of an industrial structure, and analyze industrial association relations among domain knowledge tree nodes of the industrial structure.
4. The industry analysis system based on network information resources of claim 3, wherein the domain term extraction sub-module is further configured to analyze the academic articles obtained by the academic web crawler, analyze word frequencies in titles, keywords and abstracts of the articles by using a text analysis method, and extract domain professional terms.
5. The industry analytics system based on network information resources of claim 1, wherein the foreground interaction module comprises a visualization sub-module and a map sub-module,
the visualization submodule is configured to interact the result data analyzed by the data analysis module with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list;
the map sub-module is configured to present a map of the selected area to the user.
6. An industry analysis method based on network information resources, characterized in that the method comprises:
collecting network information related to industries concerned by users;
carrying out structuralization processing on the network information, fusing the network information with preset platform data, and constructing a domain knowledge tree of an industrial structure;
analyzing the platform data and the field knowledge tree through a natural language processing method and a data mining algorithm, and extracting data related to the industry as interactive data; it includes: constructing entity recognition characteristics through text word segmentation, part-of-speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and recognizing a region entity, an organization name entity and a field term entity contained in the platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among news data, company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; deducing industrial nodes concerned by the user according to data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm;
and interacting with the user terminal through the interaction data.
7. The industry analysis method based on network information resources as claimed in claim 6, wherein the industry-related network information includes web page information and academic articles, and the collecting of the industry-related network information related to the user interest comprises:
according to a preset first initial seed node, capturing webpage information from an industry vertical website by analyzing a uniform resource locator contained in the first initial seed node by using a vertical web crawler;
and capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node.
8. The industry analysis method based on network information resources as claimed in claim 7, wherein the structuring of the network information, fusing with preset platform data, and constructing a domain knowledge tree of an industry structure, comprises:
performing structured analysis on the vertical webpage information acquired by the vertical web crawler;
extracting domain-related terms from academic articles crawled by the academic web crawler;
and structuring the extracted domain terms by combining domain expert knowledge, constructing a domain knowledge tree of an industrial structure, and analyzing the industrial association relationship among nodes of the domain knowledge tree.
9. The industry analytics method based on network information resources of claim 8, wherein the extracting domain-related terms from the academic articles crawled by the academic web crawler comprises:
in order to analyze the academic articles acquired by the academic web crawler, the word frequency in the titles, keywords and abstracts of the articles is analyzed by using a text analysis algorithm, and domain professional terms are extracted.
10. The industry analysis method based on network information resources as claimed in claim 6, wherein the interacting with the user terminal through the interaction data comprises:
interacting the interaction data with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list;
the user is presented with a map of the selected area.
CN201711475066.6A 2017-12-29 2017-12-29 Industry analysis system and method based on network information resources Active CN108229810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711475066.6A CN108229810B (en) 2017-12-29 2017-12-29 Industry analysis system and method based on network information resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711475066.6A CN108229810B (en) 2017-12-29 2017-12-29 Industry analysis system and method based on network information resources

Publications (2)

Publication Number Publication Date
CN108229810A CN108229810A (en) 2018-06-29
CN108229810B true CN108229810B (en) 2021-02-05

Family

ID=62646986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711475066.6A Active CN108229810B (en) 2017-12-29 2017-12-29 Industry analysis system and method based on network information resources

Country Status (1)

Country Link
CN (1) CN108229810B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255034A (en) * 2018-08-08 2019-01-22 数据地平线(广州)科技有限公司 A kind of domain knowledge map construction method based on industrial chain
CN110020226B (en) * 2018-08-20 2023-07-21 中国平安人寿保险股份有限公司 Big data-based data display method, user equipment, storage medium and device
CN109299362B (en) * 2018-09-21 2023-04-14 平安科技(深圳)有限公司 Similar enterprise recommendation method and device, computer equipment and storage medium
CN109543045A (en) * 2018-11-15 2019-03-29 厦门笨鸟电子商务有限公司 A kind of methods of exhibiting of whole world industrial chain
CN110020092A (en) * 2018-11-20 2019-07-16 皮商云集(厦门)科技有限公司 Leather industry data center systems based on web crawlers
CN110175239A (en) * 2019-04-23 2019-08-27 成都数联铭品科技有限公司 A kind of construction method and system of knowledge mapping
CN110263233B (en) * 2019-05-06 2023-04-07 平安科技(深圳)有限公司 Enterprise public opinion library construction method and device, computer equipment and storage medium
CN111275364A (en) * 2020-03-28 2020-06-12 苏州中灏文化科技有限公司 Regional collaborative manufacturing management service platform based on industrial map
CN112464668A (en) * 2020-11-26 2021-03-09 南京数脉动力信息技术有限公司 Method and system for extracting dynamic information of smart home industry
CN113326870B (en) * 2021-05-11 2023-08-04 中科迅(深圳)科技有限公司 Multi-platform travel data fusion system based on big data
CN113987146B (en) * 2021-10-22 2023-01-31 国网江苏省电力有限公司镇江供电分公司 Dedicated intelligent question-answering system of electric power intranet

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446065B1 (en) * 1996-07-05 2002-09-03 Hitachi, Ltd. Document retrieval assisting method and system for the same and document retrieval service using the same
CN103455636A (en) * 2013-09-27 2013-12-18 浪潮齐鲁软件产业有限公司 Automatic capturing and intelligent analyzing method based on Internet tax data
CN104376406A (en) * 2014-11-05 2015-02-25 上海计算机软件技术开发中心 Enterprise innovation resource management and analysis system and method based on big data
CN104573016A (en) * 2015-01-12 2015-04-29 武汉泰迪智慧科技有限公司 System and method for analyzing vertical public opinions based on industry

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446065B1 (en) * 1996-07-05 2002-09-03 Hitachi, Ltd. Document retrieval assisting method and system for the same and document retrieval service using the same
CN103455636A (en) * 2013-09-27 2013-12-18 浪潮齐鲁软件产业有限公司 Automatic capturing and intelligent analyzing method based on Internet tax data
CN104376406A (en) * 2014-11-05 2015-02-25 上海计算机软件技术开发中心 Enterprise innovation resource management and analysis system and method based on big data
CN104573016A (en) * 2015-01-12 2015-04-29 武汉泰迪智慧科技有限公司 System and method for analyzing vertical public opinions based on industry

Also Published As

Publication number Publication date
CN108229810A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN108229810B (en) Industry analysis system and method based on network information resources
Johnson et al. Web content mining techniques: a survey
Shollo et al. Towards an understanding of business intelligence
WO2020037917A1 (en) User behavior data recommendation method, server and computer readable medium
CN105183727A (en) Method and system for recommending book
CN102542061B (en) Intelligent product classification method
CN104281607A (en) Microblog hot topic analyzing method
CN103886074A (en) Commodity recommendation system based on social media
CN102270331A (en) Network shopping navigating method based on visual search
CN106991175B (en) Customer information mining method, device, equipment and storage medium
JP2006309515A (en) Information delivery method and information delivery server
CN103177036A (en) Method and system for label automatic extraction
CN108021651A (en) Network public opinion risk assessment method and device
Vijiyarani et al. Research issues in web mining
CN112685564A (en) Intelligent science and technology policy classification and pushing method and system
Al-Najran et al. A requirements specification framework for big data collection and capture
KR20170115109A (en) Text-Mining Application Technique for Productive Construction Document Management
Zhang Application of data mining technology in digital library.
KR20190048781A (en) System for crawling and analyzing online reviews about merchandise or service
US9165053B2 (en) Multi-source contextual information item grouping for document analysis
Talakokkula A survey on web usage mining, applications and tools
TW201421265A (en) Intellectual news analyzing system
Jian-guo et al. Web mining for electronic business application
Khobreh et al. Clarifying the Effect of Porter Greening the Competitive Advantage in the Marketing Process by Emphasizing the Marketing Information System and Information Behavior (Case Study: Oil Industry)
JP2006227925A (en) Method and apparatus for providing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District

Patentee after: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES

Patentee after: Zhongke (Luoyang) robot and intelligent equipment Research Institute

Address before: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District

Patentee before: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES

Patentee before: INNOVATION INSTITUTE FOR ROBOT AND INTELLIGENT EQUIPMENT (LUOYANG), CASIA

CP01 Change in the name or title of a patent holder