CN108229810B - Industry analysis system and method based on network information resources - Google Patents
Industry analysis system and method based on network information resources Download PDFInfo
- Publication number
- CN108229810B CN108229810B CN201711475066.6A CN201711475066A CN108229810B CN 108229810 B CN108229810 B CN 108229810B CN 201711475066 A CN201711475066 A CN 201711475066A CN 108229810 B CN108229810 B CN 108229810B
- Authority
- CN
- China
- Prior art keywords
- data
- module
- industry
- network information
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims description 27
- 230000003993 interaction Effects 0.000 claims abstract description 30
- 238000007405 data analysis Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 21
- 238000007418 data mining Methods 0.000 claims abstract description 17
- 230000002452 interceptive effect Effects 0.000 claims abstract description 13
- 239000000284 extract Substances 0.000 claims abstract description 11
- 238000003058 natural language processing Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 10
- 230000008520 organization Effects 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012800 visualization Methods 0.000 claims description 8
- 238000010801 machine learning Methods 0.000 claims description 6
- 238000013480 data collection Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 6
- 239000000463 material Substances 0.000 abstract description 3
- 238000005065 mining Methods 0.000 abstract description 2
- 238000011144 upstream manufacturing Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 10
- 238000011161 development Methods 0.000 description 9
- 230000018109 developmental process Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 7
- 230000010354 integration Effects 0.000 description 6
- 230000006399 behavior Effects 0.000 description 4
- 230000010485 coping Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000010224 classification analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0637—Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Educational Administration (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Marketing (AREA)
- Computational Linguistics (AREA)
- Development Economics (AREA)
- General Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Data Mining & Analysis (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of information analysis, provides an industry analysis system based on network information resources, and aims to solve the problems that a large amount of manpower and material resources are required to be consumed for industry information analysis, and real-time performance cannot be achieved. The system comprises: the system comprises a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to the industry; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with platform data and construct an industrial structure tree; the data analysis module is configured to analyze the platform data through a natural language processing technology and a data mining algorithm and extract data related to the keywords as interactive data; and the foreground interaction module is configured to interact with the user terminal through the interaction data. The invention realizes the mining of valuable data from massive network information and presents the analysis result of the industry for users in real time.
Description
Technical Field
The invention relates to the field of computer network information application, in particular to the field of data mining application of network information resources, and particularly relates to an industry analysis system and method based on network information resources.
Background
With the rapid development of information technology, information data in various fields show explosive growth, huge challenges and pressure are brought to workers in the industries, valuable industry information is mined from the massive data, changes of the industry information are tracked in real time, development trends of industry upstream and downstream branch workers and competitors are known, an industry management layer and a decision layer are assisted to make rapid and effective response strategies according to market changes, and the method has important reference significance.
The industry analysis is a systematic industry information integration analysis result, and has important reference significance for enterprises to find industry business opportunities, grasp market pulse, evaluate investment risk and the like. The relevant data is collected and combined with relevant working experience to perform industry analysis report, usually by market research companies within enterprises or specialties. Because the industry analysis report needs to be researched and compiled, a large amount of manpower and material resources are consumed, and the real-time performance cannot be achieved, which is in great contrast with the information era of immense change.
Disclosure of Invention
In order to solve the problems in the prior art, namely to solve the problems that a large amount of manpower and material resources are consumed and the real-time performance cannot be achieved due to the fact that an industry analysis report needs to be compiled after investigation, the following technical scheme is adopted to solve the problems:
in a first aspect, the present application provides an industry analytics system based on network information resources, the system comprising: the system comprises a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to industries concerned by users; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with preset platform data, and construct an association relationship between a domain knowledge tree of an industrial structure and a domain knowledge tree node of the industrial structure; the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the industry as interactive data; the foreground interaction module is configured to interact with the user terminal through the interaction data.
In some examples, the data collection module includes a vertical web crawler and an academic web crawler, the vertical web crawler configured to crawl web page information from an industry vertical website by analyzing uniform resource locators according to a preset first initial seed node; the academic web crawler is configured to capture academic articles from the academic website according to a preset second initial seed node.
In some examples, the data preprocessing module includes a data structuring sub-module, a platform data sub-module, a domain term extraction sub-module, and a domain knowledge tree sub-module, and the data structuring sub-module is configured to perform a structured analysis on the vertical web page information collected by the vertical web crawler; the platform data submodule is configured to store platform users and collected network information data and provide data for the analysis module; the domain term extracting submodule is configured to extract domain-related terms from the academic articles crawled by the academic web crawler; the domain knowledge tree submodule is configured to combine domain expert knowledge, perform structured organization on the extracted domain terms, construct a domain knowledge tree of an industrial structure, and analyze industrial association relations among nodes of the domain knowledge tree.
In some examples, the domain term extraction sub-module is further configured to analyze the academic articles obtained by the academic web crawler, analyze word frequencies in titles, keywords and abstracts of the articles by using a text analysis method, and extract domain professional terms.
In some examples, the data analysis module includes an entity identification submodule configured to construct an entity identification feature by text segmentation, part-of-speech tagging, and syntactic analysis, and to integrate conditional random fields and rule-based methods to identify a region entity, a place name entity, and a domain term entity contained in the platform data; the data mining submodule is configured to associate the identified entity with a domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze the association relationship among news data, company data and the domain knowledge tree, so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the operation data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
In some examples, the foreground interaction module includes a visualization sub-module and a map sub-module, where the visualization sub-module is configured to interact the result data analyzed by the data analysis module with a user in a manner of synthesizing a domain knowledge tree, a map, a line graph, a bar graph, and a list; the map sub-module is configured to present an area map of the selected area to the user.
In a second aspect, the present application provides an industry analysis method based on network information resources, including: collecting network information related to industries concerned by users; carrying out structuralization processing on the network information, fusing the network information with preset platform data, and constructing an industrial structure tree; analyzing the platform data through a natural language processing technology and a data mining algorithm, and extracting data related to the industry as interactive data; and interacting with the user terminal through the interaction data.
In some examples, the collecting network information related to industries includes collecting network information related to industries of interest of the user, including: according to a preset first initial seed node, capturing webpage information from an industry vertical website by analyzing a uniform resource locator contained in the first initial seed node by using a vertical web crawler; and capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node.
In some examples, the structuring of the network information is fused with preset platform data to construct a domain knowledge tree of an industrial structure, including structuring analysis of vertical web page information collected by a vertical web crawler; extracting domain-related terms from academic articles crawled by the academic web crawler; and combining domain expert knowledge, performing structured organization on the extracted domain terms and key technologies, constructing an industrial structure tree, and analyzing the industrial association relationship among the nodes of the structure tree.
In some examples, the extracting of domain-related terms from the academic articles crawled by the academic web crawlers includes: in order to analyze the academic articles acquired by the academic web crawler, the word frequency in the titles, keywords and abstracts of the articles is analyzed by using a text analysis algorithm, and domain professional terms are extracted.
In some examples, the analyzing the platform data through a natural language processing method and a data mining algorithm to extract data related to the industry as the interactive data includes: the method comprises the steps of constructing entity identification characteristics through text word segmentation, part-of-speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and identifying region entities, organization name entities and field term entities contained in platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among the news data, the company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
In some examples, the interacting with the user terminal through the interaction data includes: interacting the interactive data with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list; the user is presented with a map of the selected area.
According to the industry analysis system and method based on the network information resources, the data acquisition module acquires information related to the industry where the user is located, the information is subjected to structured processing through the data preprocessing module, a field knowledge tree of the industry is constructed, the preprocessed information is analyzed and mined through the data analysis module to obtain an analysis result of the industry information, and interaction is carried out with the user through the foreground interaction module. Valuable industry information is mined from mass data, changes of the industry information are tracked in real time, information of upstream and downstream branch workers and competitors of the industry is known, and an industry management layer and a decision layer are assisted to make a quick and effective coping strategy aiming at market changes.
Drawings
FIG. 1 is a schematic block diagram of an embodiment of an industry analytics system based on network information resources in accordance with the present application;
FIG. 2 is a basic framework diagram of a vertical web crawler crawling web page information flow in an embodiment of the present application;
FIG. 3 is a schematic diagram of an exemplary application of a Robotic industry chain knowledge tree built by a Domain knowledge Tree submodule in an embodiment of the present application;
FIG. 4a is a schematic diagram of upstream and downstream node relationships of an industry node constructed in an industry chain;
FIG. 4b is a schematic diagram of upstream and downstream node relationships of system integration industry nodes in a robotic industry chain constructed in the industry chain;
FIG. 5 is a diagram illustrating exemplary results of performing text segmentation, part-of-speech tagging, and syntactic analysis using a text analysis algorithm in an embodiment of the present application;
fig. 6 is a schematic diagram of an embodiment of an industry analysis method based on network information resources applied to the present application.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The industry analysis system based on the network information resources can comprise a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein the data acquisition module is configured to acquire network information related to industries concerned by users; the data preprocessing module is configured to perform structural processing on the network information, fuse the network information with preset platform data, and construct a domain knowledge tree of an industrial structure; the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the keywords as interactive data; the foreground interaction module is configured to interact with the user terminal through the interaction data.
In this embodiment, the data collection module collects industry-related network information according to keywords or key information provided by a user, where the industry-related information concerned by the user, such as information of enterprises in the same industry as the user and information of upstream and downstream enterprises, may be collected; information related to the development of the industry concerned by the user, such as the technical front, academic front and the like of the development of the industry can be collected.
The data preprocessing module is used for preprocessing the network information, the preprocessing can be used for structuring the network information related to the industry concerned by the user, and classified information such as companies related to the industry, products of the companies, company distribution areas, company purchase and the like can be extracted from the information; and classification frontier information can be established for industry development, and information such as industry development trend, technical development monitoring and the like can be obtained from the classification frontier information.
The data analysis module analyzes and mines the structured data and the classification information established by the data preprocessing module, and recommends the information of products, companies, news and the like which are possibly interested by the user to the user by combining the operation information of the user on the platform.
The foreground interaction module can interact with a user through an interaction interface, and the interaction interface can display information such as industry change trend, industry regional distribution, upstream and downstream analysis, competitors, potential buyers and the like through various chart forms; the user can intuitively obtain the industry information at any time and any place.
The system provided by the embodiment of the application analyzes and mines the information related to the industry concerned by the user, and displays the industry information obtained by analyzing and mining for the user.
Fig. 1 is a schematic diagram illustrating an exemplary system architecture of a specific embodiment of an industry analysis system based on network information resources to which the present application may be applied.
Specifically, as shown in fig. 1, a data acquisition module, a data preprocessing module, a data analysis module, and a foreground interaction module of the industry analysis system based on network information resources respectively implement data acquisition, data preprocessing, data analysis, and foreground interaction functions.
The data collection module includes a vertical web crawler 101 and an academic web crawler 102. The vertical web crawler 101 is configured to capture web page information from an industry vertical website by analyzing a Uniform Resource Locator (URL) according to a preset first initial seed node. Specifically, a representative website can be selected according to the industry as the first initial seed node of the vertical web crawler. The vertical web crawler 101 crawls web page information related to the industry concerned by the user by analyzing the URL of the website. The web page information includes news of related enterprises in the industry, organization information, products, purchase information and the like in the industry.
The academic web crawler 102 is configured to capture an academic article from an academic website according to a preset second initial seed node. Here, the academic conference and the academic journal based can be used as a second initial node of the academic web crawler. The academic web crawler 102 crawls related academic articles from academic websites, academic journals or academic conferences according to the second initial seed node, and obtains development front information of the industry.
The vertical web crawler 101 and the academic web crawler 102 can run according to actual requirements and fixed periods, and provide data support for real-time industry analysis. For example, because the industry information update frequency is high, the vertical web crawler 101 may operate once every 1 hour, so that the obtained information can achieve a real-time effect as much as possible; while the knowledge domain or academic domain may be updated less frequently, the academic web crawler 102 may be run daily or monthly.
As an example, FIG. 2 illustrates the flow of the above-described vertical web crawler 101 crawling for industry-related data.
Step 2.1: selecting a representative website as a seed URL of the module 201 according to an industry concerned by a user, wherein the initial seed of the web crawler is stored in the module 201;
step 2.2: pressing the seed URL of the module 201 into a URL queue to be captured of the module 202;
step 2.3: the module 203 reads the URL read from the URL queue to be captured, and filters the selected URL using the URL filter 204, specifically, the read URL is analyzed, and only the web page URL related to news and companies is reserved;
step 2.4: the downloader of block 205 crawls the filtered URL web pages from the vertical web site and saves the web page content using block 206;
step 2.5: saving the webpage of the module 206 in a webpage database 209, and simultaneously pressing the webpage URL which is crawled successfully into a grabbed URL queue 208 of the module 207;
step 2.6: using module 206 to extract URLs from the web page and filter crawled URLs therein, pushing non-crawled URLs to module 202;
step 2.7: and judging whether the queue of the module 202 has the webpage URL which is not grabbed, if so, skipping to the step 2.3, otherwise, finishing the crawling of the network crawler to the industry related data.
In this embodiment, the data preprocessing module includes a data structuring sub-module 103, a platform data sub-module 105, a domain term extracting sub-module 104, and a domain knowledge tree sub-module 106. The data structuring sub-module 103 is configured to parse and structure the web page content crawled by the vertical web crawler 101, merge the web page content with the platform data in the platform data sub-module 105, and provide basic data for further analysis and processing of the data. For the news web pages crawled by the vertical web crawler 101, the data structuring sub-module 103 extracts news titles, release times, web page contents, and the like of the news web pages by using web page parsing tools, such as beautilfugup, lxml, and the like. As an example, a news data table obtained by analyzing the news web pages crawled by the vertical web crawler 101 by using a web page analyzing tool is shown in table 1.
Table 1 a news data sheet is shown,
the enterprise organization information acquired by the vertical web crawler 101 includes information about companies related to enterprises, acquires company web content, analyzes the company web content using a web analysis tool, and extracts information about the name, address, product, purchase, company introduction, and the like of the company indicated by the company web content. By way of example, referring to table 2, table 2 shows a company data table obtained by analyzing the company web pages crawled by the vertical web crawler 101 by using a web page analyzing tool.
TABLE 2 company data
The domain term extraction sub-module 104 analyzes the academic articles acquired by the academic web crawler 102 by using a web page parsing tool, and further extracts domain terms of the academic articles. Since the title, the keyword, and the abstract contained in the academic article are abstracts of the core content of the academic article, the analysis of the academic article may be to analyze the title, the keyword, and the abstract of the academic article first, and then analyze the content of the academic article as needed. Various text analysis algorithms are embedded in the domain term extraction submodule 104, and the academic articles are analyzed by using the text analysis algorithms embedded in the domain term extraction submodule 104. Specifically, the above-mentioned analyzing the academic article by using the text analysis algorithm may be extracting a keyword of a text of the academic article by using a term frequency-inverse document frequency algorithm (TF-IDF) and a Latent semantic analysis algorithm (LDA), analyzing the term frequency in a title, the keyword and a summary of the academic article by using a clustering method, and extracting a term whose occurrence number is greater than a set threshold as a domain term of a child node in the domain knowledge tree. The term set formed by the domain terms of each child node in the domain knowledge tree can be used for analyzing the relationship between the network information data and the domain knowledge tree. As an example, referring to table 3, table 3 shows a data structure of a scholarly article, and the text analysis algorithm analyzes the scholarly article based on the contents shown in the data structure of table 3.
Table 3 data sheet of academic papers
The platform data sub-module 105 is configured to perform data analysis by the data analysis module to provide basic data and preprocessed data. The platform data sub-module 105 stores various information including user operation behaviors, company products, purchase requests, company news, company information, region information, and the like in the platform. The user operation behaviors are operation behaviors of the user in the system platform, such as browsing news, clicking products, issuing requirements and the like, and are used for tracking and recording behavior information of the user, analyzing user interests for algorithms and providing data support. The company product may be product information, such as product name, product profile, product function, product parameter, etc., published by a company user in the platform. The purchase demand can enable the user to publish purchase information in the platform, such as product name, parameters, price, limited area and the like. The company news can be news information released in the platform by company users, including news titles, authors, contents and the like. The company information may be registration information of a company user in the system platform, such as a company name, a registration address, a main business, and the like; the regional information can be Chinese geographic information constructed in a system platform, including provincial and urban full names, short names, longitude and latitude coordinates and areas, and is used for analyzing network information and positioning company position information.
The domain knowledge tree sub-module 106 is configured to construct an association between the domain knowledge tree and the domain knowledge tree nodes of the industrial structure in combination with the expert knowledge and the extracted domain expertise. The domain knowledge tree sub-module 106 may construct a domain knowledge tree of an industry according to the extracted data information of the industry in which the company user is located. Firstly, building industrial chain nodes which are respectively an industrial chain upstream node, an industrial chain midstream node and an industrial chain downstream node; then, respectively constructing an upstream node of the industrial chain, an intermediate node of the industrial chain and a child node of a downstream node of the industrial chain according to the webpage information and expert knowledge crawled by the web crawler; and finally, continuously taking each child node as an intermediate node, and constructing the child nodes of each intermediate node, thereby constructing a domain knowledge tree of the industry and industry chain where the company user is located. By way of example, FIG. 3 illustrates a domain knowledge tree of the robotic industry chain constructed by the domain knowledge tree sub-module 106 described above. In a robot industrial chain, the robot industrial chain is divided into an industrial chain upstream node, an industrial chain midstream node and an industrial chain downstream node. The upstream node of the industrial chain is a supplier and comprises child nodes such as raw materials, parts and the like; the industrial chain downstream nodes are after-sales services and applications, and comprise a partner sub-node, an agent sub-node, a third-party service sub-node, a solution sub-node and the like; the industrial chain midstream node is an industry main business, and comprises a robot body node and a robot integration node as a field tree trunk, wherein the robot integration node comprises a plurality of layers of sub-nodes, for example, the sub-nodes of the robot integration node comprise intelligent robot nodes, the sub-nodes of the intelligent robot nodes comprise industrial robot sub-nodes, and the industrial robot sub-nodes comprise a carrying robot sub-node and the like.
Fig. 4 shows a schematic diagram of the industrial nodes upstream and downstream of the industrial chain of robots constructed by the domain knowledge tree submodule 106. Fig. 4a shows the relationship between upstream and downstream nodes of each node, and fig. 4b shows a schematic diagram of a specific example of an industrial link point in a robot industrial chain, for example, in the robot industrial chain, when an industrial node is "system integration", the upstream industrial node includes a sensor, a controller, and the like, and the downstream industrial node includes a third party, an agent, and the like.
In this embodiment, the data analysis module includes an entity identification submodule 107 and a data mining submodule 108, where the entity identification submodule 107 is configured to construct an entity identification feature through text word segmentation, part-of-speech tagging and syntactic analysis, integrate a conditional random field and a rule-based method, and identify a region entity, a mechanism name entity, and a domain term entity included in platform data; the data mining submodule 108 is configured to associate the identified entity with the domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze an association relationship between the news data, the company data and the domain knowledge tree, so as to analyze a distribution condition and a variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
The entity identification submodule 107 includes six units, namely, text segmentation, part of speech tagging, syntactic analysis, region identification, organization name identification and domain term identification, and the text segmentation, the part of speech tagging and the syntactic analysis are used for constructing entity identification characteristics. Taking "the robot is a machine device that automatically executes work" as an example, text segmentation, part-of-speech tagging, and syntactic analysis are performed, and the result is shown in fig. 5, and the entity identification features extracted therefrom are shown in table 4; then, a Conditional Random Field (CRF) and a rule-based method are fused, and a region entity, an organization name entity and a domain term entity contained in each piece of information are found and identified.
Table 4 entity identification features
The data mining sub-module 108 uses a supervised learning algorithm to construct and identify the association between the entity and the industry node, analyze the association between news data, company data and the domain knowledge tree, count industry trend changes, industry regional distribution, and upstream and downstream analysis, and infer the industry node concerned by the user or the company by analyzing the data of the user or the company in the platform, such as information of issued products, purchasing and the like, and recommend the product, the company and the news interested by the user or the company.
In this embodiment, the foreground interaction module includes a visualization sub-module and a map sub-module, and the visualization sub-module is configured to interact the result data analyzed by the data analysis module with the user in a comprehensive manner through the domain knowledge tree 109, the line graph 111, the bar graph 112, and the list 113; the map sub-module is configured to present a map of the selected area as an area map of the area.
The visualization sub-module interacting with the user presents various analysis result information for the user in the modes of the domain knowledge tree 109, the line graph 111, the bar graph 112 and the list 113.
The domain knowledge tree 109 presents the domain knowledge tree structure of the industry concerned by the user for the user to select the industry node to view.
The map sub-module presents the area of each province and city of China for the user, and when a certain province and city is selected, the user can automatically jump to the province map of the province.
The line graph 111 shows the trend of the news popularity of an industrial node in a certain area along with the time for the user.
The histogram 112 presents the news popularity distribution of an industrial node in a certain area to the user.
The list 113 presents the information of the upstream and downstream companies, competitors and potential buyers, and recommended news to the user in a list.
The system provided by the embodiment of the application extracts information related to the industry where the user is located from the mass data through the data acquisition module; the data preprocessing module carries out data structuring processing on the extracted information; and constructing a domain knowledge tree. The data analysis module analyzes and mines the processed information, and analyzes the industry development trend by combining with expert knowledge, so as to provide an industry analysis report for the user; the foreground interaction module performs information interaction with a user and provides information related to industry for the user. The user can timely master the real-time change of each node of the industry, know the information of the upstream and downstream branch workers and competitors of the industry, and assist an industry management layer or a decision layer to make a quick and effective coping strategy aiming at the market change.
Referring to fig. 6, the present application provides an industry analysis method based on network information resources, which includes the following steps:
In this embodiment, an electronic device (which may be a server or an application platform) applied to the present application acquires industry-related network information from an industry-related website by using a web crawler. Here, the website related to the industry concerned by the user may be a website of a company in the industry and upstream and downstream industries where the user is or is engaged, and may also be a technical and academic forum or website related to the industry. The web crawler can be a vertical web crawler and can also be an academic web crawler. The vertical web crawler collects news, organizations, products and purchasing information from the related websites in the field. The academic web crawler captures related academic articles from academic conferences and academic journal websites related to the field. The network information may be news, organization, product and purchasing information, and may also be academic articles.
In some preferred embodiments, the network information related to the industries of interest of the users includes web page information and academic articles, and the collecting network information related to the industries of interest of the users includes: and according to a preset first initial seed node, utilizing a vertical web crawler to capture webpage information from an industry vertical website by analyzing the uniform resource locator of the first initial seed node. And capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node. Here, the first initial seed node is an initial seed node for selecting a representative website according to the industry as a web crawler. The second initial seed node may be based on academic conferences and academic journal works as the initial seed node. The web crawler crawls relevant web page information or academic articles by analyzing the URL.
And 602, performing structural processing on the network information, fusing the network information with preset platform data, and constructing a domain knowledge tree of an industrial structure.
In this embodiment, the server or the application platform performs data preprocessing on the network information to construct an industrial structure tree. Here, the data preprocessing may be structured analysis of vertical web page information collected by a vertical web crawler; and the method can also be used for extracting field related terms and key technologies from academic articles crawled by academic web crawlers, and combining with field expert knowledge to structurally organize the extracted field terms and key technologies, construct an industrial structure tree, and analyze the industrial association relationship among the nodes of the structure tree. Further, industry or industry related terms and key technical information are extracted from academic articles crawled by the academic web crawler, and domain professional terms can be extracted by analyzing the academic articles obtained by the academic web crawler and analyzing word frequency in article titles, keywords and abstracts by using a text analysis algorithm. The text analysis algorithm can be TF-IDF, LDA, clustering and other algorithms.
In this embodiment, a natural language processing method may be used to identify a regional entity, a domain term entity, and an organization name entity from network information such as news, companies, products, and purchasing; the data mining algorithm can be used for carrying out classification analysis on knowledge nodes of news, companies, products, purchasing and other information according to the relationship between the identified domain term entities and the domain knowledge tree nodes, statistics is carried out according to the regions where the information is located and the releasing time, and the trend change of the industry is tracked based on the news heat change of the knowledge nodes.
Further, analyzing the platform data through a natural language processing method and a data mining algorithm, extracting data related to industries as interactive data, constructing entity identification characteristics through text word segmentation, part of speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and identifying a region entity, a mechanism name entity and a domain term entity contained in the platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among the news data, the company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of the network information data at each node of a region and an industrial chain; and deducing the industrial nodes concerned by the user according to the data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm.
And step 604, interacting with the user terminal through the interactive data.
In this embodiment, information interaction is performed with a user through an interactive application provided by an application platform. Here, the interactive application may be a visualization application, such as a line graph, a bar graph, a list, to display the analysis results. Specifically, the method comprises the following steps:
and displaying the industrial trend change of the selected domain knowledge tree nodes in the selected region range for the user by using the line graph.
And displaying the region distribution condition of the selected domain knowledge tree nodes in the selected region range for the user by using the histogram.
Using the list to present the upstream and downstream enterprise displays of the selected domain knowledge tree nodes in the selected region range for the user; using the list, recommending companies in which the user is interested for the user; using the list to recommend products of interest to the user; using the list, the user is recommended news of interest.
The method provided by the embodiment of the application can extract effective information from mass data, presents real-time changes of each node of the industry for users, knows division of labor and competitors of the upstream and downstream of the industry, assists an industry management layer, a decision layer and the like, and makes a quick and effective coping strategy according to market changes.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (10)
1. An industry analysis system based on network information resources, the system comprising: a data acquisition module, a data preprocessing module, a data analysis module and a foreground interaction module, wherein,
the data acquisition module is configured to acquire network information related to industries concerned by the user;
the data preprocessing module is configured to perform structured processing on the network information, fuse the network information with preset platform data, and construct an association relationship between a domain knowledge tree of an industrial structure and a domain knowledge tree node of the industrial structure;
the data analysis module is configured to analyze the platform data and the domain knowledge tree through a natural language processing method and a data mining algorithm, and extract data related to the industry as interactive data; the data analysis module comprises an entity identification submodule and a data mining submodule, wherein the entity identification submodule is configured to construct entity identification characteristics through text word segmentation, part of speech tagging and syntactic analysis, and identify a region entity, a mechanism name entity and a field term entity contained in the platform data by fusing a conditional random field and a rule-based method; the data mining submodule is configured to associate the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and statistically analyze the association relationship between news data, company data and the domain knowledge tree, so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; reasoning about the industrial nodes concerned by the user according to the operation data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm;
and the foreground interaction module is configured to interact with the user terminal through the interaction data.
2. The industry analytics system based on network information resources of claim 1, wherein the data collection module comprises a vertical web crawler and an academic web crawler,
the vertical web crawler is configured to capture webpage information from an industry vertical website by analyzing a uniform resource locator according to a preset first initial seed node;
the academic web crawler is configured to capture academic articles from the academic website according to a preset second initial seed node.
3. The industry analytics system based on network information resources of claim 2, wherein the data pre-processing module comprises a data structuring sub-module, a platform data sub-module, a domain term extraction sub-module and a domain knowledge tree sub-module,
the data structuring sub-module is configured to perform structured analysis on vertical webpage information crawled by the vertical web crawler;
the platform data submodule is configured to store platform users and collected network information data and provide data for the analysis module;
the domain term extraction submodule is configured to extract domain-related terms from academic articles crawled by the academic web crawler;
the domain knowledge tree submodule is configured to combine domain expert knowledge, perform structured organization on the extracted domain terms, construct a domain knowledge tree of an industrial structure, and analyze industrial association relations among domain knowledge tree nodes of the industrial structure.
4. The industry analysis system based on network information resources of claim 3, wherein the domain term extraction sub-module is further configured to analyze the academic articles obtained by the academic web crawler, analyze word frequencies in titles, keywords and abstracts of the articles by using a text analysis method, and extract domain professional terms.
5. The industry analytics system based on network information resources of claim 1, wherein the foreground interaction module comprises a visualization sub-module and a map sub-module,
the visualization submodule is configured to interact the result data analyzed by the data analysis module with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list;
the map sub-module is configured to present a map of the selected area to the user.
6. An industry analysis method based on network information resources, characterized in that the method comprises:
collecting network information related to industries concerned by users;
carrying out structuralization processing on the network information, fusing the network information with preset platform data, and constructing a domain knowledge tree of an industrial structure;
analyzing the platform data and the field knowledge tree through a natural language processing method and a data mining algorithm, and extracting data related to the industry as interactive data; it includes: constructing entity recognition characteristics through text word segmentation, part-of-speech tagging and syntactic analysis, fusing a conditional random field and a rule-based method, and recognizing a region entity, an organization name entity and a field term entity contained in the platform data; associating the identified entities with the domain knowledge tree by using a supervised machine learning algorithm, and carrying out statistical analysis on the association relationship among news data, company data and the domain knowledge tree so as to analyze the distribution condition and the variation trend of network information data at each node of a region and an industrial chain; deducing industrial nodes concerned by the user according to data of the user on the platform, and recommending personalized news, companies and products for the user by using a content-based recommendation algorithm;
and interacting with the user terminal through the interaction data.
7. The industry analysis method based on network information resources as claimed in claim 6, wherein the industry-related network information includes web page information and academic articles, and the collecting of the industry-related network information related to the user interest comprises:
according to a preset first initial seed node, capturing webpage information from an industry vertical website by analyzing a uniform resource locator contained in the first initial seed node by using a vertical web crawler;
and capturing an academic article from the academic website by using the academic web crawler according to a preset second initial seed node.
8. The industry analysis method based on network information resources as claimed in claim 7, wherein the structuring of the network information, fusing with preset platform data, and constructing a domain knowledge tree of an industry structure, comprises:
performing structured analysis on the vertical webpage information acquired by the vertical web crawler;
extracting domain-related terms from academic articles crawled by the academic web crawler;
and structuring the extracted domain terms by combining domain expert knowledge, constructing a domain knowledge tree of an industrial structure, and analyzing the industrial association relationship among nodes of the domain knowledge tree.
9. The industry analytics method based on network information resources of claim 8, wherein the extracting domain-related terms from the academic articles crawled by the academic web crawler comprises:
in order to analyze the academic articles acquired by the academic web crawler, the word frequency in the titles, keywords and abstracts of the articles is analyzed by using a text analysis algorithm, and domain professional terms are extracted.
10. The industry analysis method based on network information resources as claimed in claim 6, wherein the interacting with the user terminal through the interaction data comprises:
interacting the interaction data with a user in a comprehensive mode of a domain knowledge tree, a map, a line graph, a bar graph and a list;
the user is presented with a map of the selected area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711475066.6A CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711475066.6A CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108229810A CN108229810A (en) | 2018-06-29 |
CN108229810B true CN108229810B (en) | 2021-02-05 |
Family
ID=62646986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711475066.6A Active CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108229810B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255034A (en) * | 2018-08-08 | 2019-01-22 | 数据地平线(广州)科技有限公司 | A kind of domain knowledge map construction method based on industrial chain |
CN110020226B (en) * | 2018-08-20 | 2023-07-21 | 中国平安人寿保险股份有限公司 | Big data-based data display method, user equipment, storage medium and device |
CN109299362B (en) * | 2018-09-21 | 2023-04-14 | 平安科技(深圳)有限公司 | Similar enterprise recommendation method and device, computer equipment and storage medium |
CN109543045A (en) * | 2018-11-15 | 2019-03-29 | 厦门笨鸟电子商务有限公司 | A kind of methods of exhibiting of whole world industrial chain |
CN110020092A (en) * | 2018-11-20 | 2019-07-16 | 皮商云集(厦门)科技有限公司 | Leather industry data center systems based on web crawlers |
CN110175239A (en) * | 2019-04-23 | 2019-08-27 | 成都数联铭品科技有限公司 | A kind of construction method and system of knowledge mapping |
CN110263233B (en) * | 2019-05-06 | 2023-04-07 | 平安科技(深圳)有限公司 | Enterprise public opinion library construction method and device, computer equipment and storage medium |
CN111275364A (en) * | 2020-03-28 | 2020-06-12 | 苏州中灏文化科技有限公司 | Regional collaborative manufacturing management service platform based on industrial map |
CN112464668A (en) * | 2020-11-26 | 2021-03-09 | 南京数脉动力信息技术有限公司 | Method and system for extracting dynamic information of smart home industry |
CN113326870B (en) * | 2021-05-11 | 2023-08-04 | 中科迅(深圳)科技有限公司 | Multi-platform travel data fusion system based on big data |
CN113987146B (en) * | 2021-10-22 | 2023-01-31 | 国网江苏省电力有限公司镇江供电分公司 | Dedicated intelligent question-answering system of electric power intranet |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446065B1 (en) * | 1996-07-05 | 2002-09-03 | Hitachi, Ltd. | Document retrieval assisting method and system for the same and document retrieval service using the same |
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104573016A (en) * | 2015-01-12 | 2015-04-29 | 武汉泰迪智慧科技有限公司 | System and method for analyzing vertical public opinions based on industry |
-
2017
- 2017-12-29 CN CN201711475066.6A patent/CN108229810B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446065B1 (en) * | 1996-07-05 | 2002-09-03 | Hitachi, Ltd. | Document retrieval assisting method and system for the same and document retrieval service using the same |
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104573016A (en) * | 2015-01-12 | 2015-04-29 | 武汉泰迪智慧科技有限公司 | System and method for analyzing vertical public opinions based on industry |
Also Published As
Publication number | Publication date |
---|---|
CN108229810A (en) | 2018-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108229810B (en) | Industry analysis system and method based on network information resources | |
Johnson et al. | Web content mining techniques: a survey | |
Shollo et al. | Towards an understanding of business intelligence | |
WO2020037917A1 (en) | User behavior data recommendation method, server and computer readable medium | |
CN105183727A (en) | Method and system for recommending book | |
CN102542061B (en) | Intelligent product classification method | |
CN104281607A (en) | Microblog hot topic analyzing method | |
CN103886074A (en) | Commodity recommendation system based on social media | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN106991175B (en) | Customer information mining method, device, equipment and storage medium | |
JP2006309515A (en) | Information delivery method and information delivery server | |
CN103177036A (en) | Method and system for label automatic extraction | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
Vijiyarani et al. | Research issues in web mining | |
CN112685564A (en) | Intelligent science and technology policy classification and pushing method and system | |
Al-Najran et al. | A requirements specification framework for big data collection and capture | |
KR20170115109A (en) | Text-Mining Application Technique for Productive Construction Document Management | |
Zhang | Application of data mining technology in digital library. | |
KR20190048781A (en) | System for crawling and analyzing online reviews about merchandise or service | |
US9165053B2 (en) | Multi-source contextual information item grouping for document analysis | |
Talakokkula | A survey on web usage mining, applications and tools | |
TW201421265A (en) | Intellectual news analyzing system | |
Jian-guo et al. | Web mining for electronic business application | |
Khobreh et al. | Clarifying the Effect of Porter Greening the Competitive Advantage in the Marketing Process by Emphasizing the Marketing Information System and Information Behavior (Case Study: Oil Industry) | |
JP2006227925A (en) | Method and apparatus for providing information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District Patentee after: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES Patentee after: Zhongke (Luoyang) robot and intelligent equipment Research Institute Address before: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District Patentee before: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES Patentee before: INNOVATION INSTITUTE FOR ROBOT AND INTELLIGENT EQUIPMENT (LUOYANG), CASIA |
|
CP01 | Change in the name or title of a patent holder |