CN108334591A

CN108334591A - Industry analysis method and system based on focused crawler technology

Info

Publication number: CN108334591A
Application number: CN201810088951.7A
Authority: CN
Inventors: 薛文芳; 韩艳超; 张德馨; 郑浩楠; 薛金鸽
Original assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd
Current assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2018-07-27

Abstract

The present invention discloses the industry analysis method and system based on focused crawler technology.This method is to carry out information scratching to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structured data information；Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information；Using text classification and clustering algorithm, target information is extracted from useful information, forms the subject dataset of industry；By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, forms industry analysis report.The enterprise that the present invention can assist policy and economic environment, the dynamic of rival and discovery growth residing for industry analyst's monitoring industry fast, and industrial concentration, market scale, growth rate, development trend are analyzed.

Description

Industry analysis method and system based on focused crawler technology

Technical field

The present invention relates to industry analysis technical fields, and in particular to a kind of industry analysis method based on focused crawler technology And system.

Background technology

Industry is the important tool of various investment decisions, and deep industry research is even more to invest successful prerequisite.When The method that preceding most industries researcher carries out industry research is mainly investigated by internet hunt, on the spot, faces interview The approach such as what is said or talked about obtain industry data, are then processed analysis according to economic theory and industry experience.Internet is most of Industry research personnel obtain the most important approach of data, by taking international top consulting firm Mai Kenxi as an example, market survey data In to have more than 50% obtained from the network approach of increasing income.

Due to the explosive growth of Internet era information, the bottleneck of traditional data acquisition method becomes increasingly conspicuous, industry point Analysis teacher is increasingly difficult to disposably or simply to inquire its required information in the network data of magnanimity.In this shape Under gesture, " big data " obtains global scientific and technological circle, industrial circle, government as a kind of emerging data processing technique and Cognitive Thinking The great attention of department, it is considered to be mass data collection, excavation, analysis powerful, have become grinding for countries in the world Study carefully forward position and strategical planning emphasis.Big data technology is introduced into industry analysis, is had for innovation and the breakthrough in the field important Meaning.

Invention content

In view of the technical drawbacks of the prior art, it is an object of the present invention to provide the rows based on focused crawler technology Industry analysis method and system.

The technical solution adopted to achieve the purpose of the present invention is：

Industry analysis method based on focused crawler technology, includes the following steps：

Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non- The data information of structuring；

Page info parsing, data cleansing and contents extraction are carried out to the data information that captures, to duplicate message into Row duplicate removal, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information；

Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws Policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, Industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry The subject dataset of production capacity distribution map theme；

By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional It is existing, form industry analysis report.

Wherein, the focused web crawler technology is to the step of targeted website progress information scratching：

It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, is searched Collect all sublinks wherein included；

According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation Link to be visited is selected, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than setting threshold The link of value filters out；

Start Theme Crawler of Content, executing data to the link in url list to be visited successively crawls.

Wherein, in the vector determination sublink according to theme feature web page contents and theme to be determined the degree of correlation, be It is weighed by the included angle cosine of theme feature vector feature vector corresponding with webpage to be determined.

Wherein, the theme feature vector is calculated by TF-IDF algorithms, and steps are as follows：

One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a spy Item is levied, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms Weights, the characteristic item of the website and webpage is converted into n-dimensional space vector, n-dimensional space vector per one-dimensional correspondence one characteristic item, The weights of character pair item in the web page are represented, vector space W (d)=(w is converted by n characteristic item₁, w₂, w₃..., w_n), w_iRepresent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.

Wherein, the text classification and clustering algorithm use gauss hybrid models.

The present invention also aims to provide the industry analysis system based on focused crawler technology, including：

Data acquisition module obtains target for carrying out information scratching to targeted website using focused web crawler technology The structuring of industry and non-structured data information；

Data preprocessing module, for carrying out page info parsing, data cleansing and interior to the data information that captures Hold extraction, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful letter Breath；

Data analysis module extracts target information, respectively for using text classification and clustering method from useful information Formed comprising industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, Industry market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry The subject dataset of competition situation theme, industry production capacity distribution map theme；

Data application module, the content for by way of document and or chart, each subject data to be concentrated carry out The visualization of information multidimensional shows, and forms industry analysis report.

The data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.

Industry analysis method based on focused crawler technology proposed by the invention, by focused crawler technology from target network It stands etc. structure/non-structural data of crawl specific industry, text classification and cluster then is carried out to the content after crawl, respectively Formation includes law ＆ policy theme, economic environment theme, market capacity theme, state of the art theme, enterprise production sales volume master Multiple subject datasets including topic, competition situation theme, the content for then concentrating each subject data is with document and chart Form show, automatically generate industry report, so as to assist industry analyst monitor industry residing for policy and economic ring Border, rival dynamic and find the fast enterprise of growing up, and become to industrial concentration, market scale, growth rate, development Gesture is analyzed.

Description of the drawings

Fig. 1 is the workflow schematic diagram of the industry analysis method based on focused crawler technology；

Fig. 2 is the flow chart that focused web crawler carries out targeted website information scratching；

Fig. 3 is the principle schematic of the industry analysis system based on focused crawler technology.

Specific implementation mode

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It should be appreciated that described herein Specific embodiment be only used to explain the present invention, be not intended to limit the present invention.

Shown in Fig. 1-2, the industry analysis method based on focused crawler technology includes the following steps：

By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional It is existing, form the report of industry analysis primary.

Wherein, can be carried out specially by corresponding algorithm and artificial means designated key word when extracting target information Topic detecting, refine and find industrial life cycle, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy, The target informations such as market capacity, to form corresponding subject dataset.

Wherein, the data source of the information captured is to increase income data, including internet is increased income data, think tank's business datum And the energy, environment, policy, law, economic research data etc., by the acquisition of the data to these, formed structuring and Non-structured data information, can be stored in distributed frame in non-structural database, the big data point for data information Analysis is handled.

It,, can using distributed frame/non-structural database when the data to acquisition carry out storage processing in the present invention By building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop, structuring number is supported According to, distributed storage and the parallel computation of semi-structured data and unstructured data, realize PB grades of multi-source heterogeneous big datas Batch and Stream Processing, the data of separate sources are cleaned by Hadoop interactive modules, format judge, be uploaded to HDFS into Row storage is handled.

The present invention is by will be including the big data skill including focused crawler technology, distributed storage technology, cloud computing technology Art is combined with industry research, by building the data sampling and processing and mining analysis platform of Industry-oriented analysis field, is used Technological means is acquired, arranges and stores to increase income data, think tank's business datum, manufacturing enterprise's data of internet, by poly- The mining analysis such as class, association, recurrence refine valuable information, are given birth to state-of-art, the industry of Object Industry at home and abroad Order period, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy, market capacity, market entrance and exit The problems such as difficulty and detailed programs feasibility, carries out intellectual analysis, may be implemented to industry overall development situation make comprehensively, Objective, quantization evaluation.

In the present invention, particular row is built using the PEST models in industry analysis by using focused web crawler technology Owner inscribes benchmark model, determines that Theme Crawler of Content needs the link range crawled, guides reptile continuous by certain search strategy New theme related pages are excavated, finally carry out web page analysis to crawling page set, purposefully, are selectively adopted from targeted website Collect the structures such as text, the picture in relation to specific industry and non-structural data.

According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation Link to be visited is selected, is classified by mark value to the link set by relatedness computation processing, relevance degree is reached Url list to be visited is added in the link of given threshold, and the link that the degree of correlation is less than to given threshold filters out；

Start Theme Crawler of Content, successively in url list to be visited link execute data crawl, others link no longer into Row processing.The condition that reptile is terminated is that url list to be visited is that empty or reptile has crawled enough theme related pages.

Wherein, the keyword set can determine in the following manner, and artificial or intellectual analysis method is taken to collect The higher vocabulary of the frequency of occurrences in industry analysis field, such as " XX industries ", " yield ", " sales volume ", " disappear and sell volume ", " production capacity utilizes Rate ", " industrial concentration ", " occupation rate of market " etc. form initial key word set；By the entry in initial key word set according to net Stand, manufacturer, product, the columns such as area are classified, form the keyword set or lists of keywords of the specific industry for needing to monitor, Then according to each entry of the keyword set, theme feature vector is calculated by above-mentioned TF-IDF algorithms.

The TF-IDF algorithms are the common weighting techniques prospected for information retrieval and information, to assess a certain keyword For the significance level of a webpage.The importance of keyword is but same with the directly proportional increase of number that it occurs in webpage When the frequency that can occur in corpus with it be inversely proportional decline.Its principle is as follows：

In a given webpage, word frequency (Term Frequency, TF) refers to that some given keyword exists The number occurred in the webpage.This number would generally be normalized, to prevent it to be biased to long webpage.(because of the same pass Keyword may have higher word frequency in long webpage than short webpage, whether important but regardless of the keyword.)

Reverse document-frequency (Inverse Document Frequency, IDF) is a keyword general importance Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action To quotient take that logarithm obtains the high keyword frequency in a certain particular webpage and the keyword is low in entire collections of web pages Webpage frequency can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common keyword, retains weight The keyword wanted.

In given webpage, word frequency (term frequency, TF) refers to some given keyword in the web page The frequency of appearance.This numerical value is the normalization to word number (term count), to prevent it to be biased to long webpage.(because same One keyword may have higher word number in long webpage than short webpage, whether important but regardless of the keyword.) to Mr. Yu I-th of keyword t in particular webpage_i, importance is expressed as：

In formula, n_i,jIt is i-th of keyword t_iIn j-th of webpage d_jIn occurrence number, and denominator is then in webpage d_j The sum of the occurrence number of all words.

Reverse document-frequency (inverse document frequency, IDF) is a keyword general importance Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action To quotient take logarithm to obtain：

Wherein | D | it is total for related web page, | { j:t_i∈d_j| it includes i-th of keyword t to be_iWebpage number, if not having Related web page with the keyword, may result in dividend is zero, therefore is used under normal circumstances | { j:t_i∈d_j}|+1

Finally calculate TFIDF:

TFIDF_i,j=TF_i,j×IDF_i

It first passes through TF-IDF algorithms and calculates the corresponding feature vector, Xs (d) of webpage d to be discriminated, be denoted as：

X (d)=(x₁(d), x₂(d), x₃(d) ..., x_n(d)),

Wherein x_nRepresent feature vector (i.e. weights) of n-th of characteristic item in webpage d to be discriminated.If the page to be discriminated with Topic correlativity is sim,

In formula, x_nIndicate corresponding n-th of the feature vector of currently processed webpage, w_nIndicate n-th of theme feature vector.When When sim is more than or equal to specified relevance threshold, it can just assert that the currently processed page is the theme related pages.

Wherein, the text classification and clustering algorithm use gauss hybrid models (GMM).

GMM algorithms be one is being overlapped generation by multiple Gaussian distribution models, be to one kind of probability density often See description, is described as follows shown in formula：

Wherein：

In above formula, μ_gIt is the mean value of Gaussian Profile, Σ_gIt is the covariance of Gaussian Profile, exp-exponential function,

ω_gIt is the weight of a Gaussian Profile, 0≤ω need to be met_g≤ 1, meetN_gIt is total in the model How many a Gaussian Profiles are shared, the number of Gaussian Profile in N-model.The step of GMM algorithms, is as follows：

1) training models under enough training datas, find the μ in k Gauss model_g, Σ_gInitial value；

2) and x_iCorresponding estimated data is as follows in the probability that k-th of Gauss model generates：

γ (i, k) and μ_k,Σ_kCan be obtained by a variety of methods, most common method be so that maximum likelihood function most Big solution.

3) estimates multiple parameters, and applies maximum likelihood：

In above formula,Thus in the formula in also can be obtained the 2)

It is last constantly to execute step 2) again and 3) until all characteristic point positions are motionless.

Wherein, the data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user, Then the theme that corresponding theme is concentrated is exported according to retrieval information by system, the form by picture or document can be passed through To show user to use.

Industry analysis system of the offer based on focused crawler technology further includes being distributed formula structure/non-structural data Library can support knot by building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop Structure data, the distributed storage of semi-structured data and unstructured data and parallel computation are realized PB grades multi-source heterogeneous big The batch and Stream Processing of data clean the data of separate sources by Hadoop interactive modules, format judgement, are uploaded to HDFS is stored.

The operation principle and procedure declaration of the above-described industry analysis system based on focused crawler technology, are asked in detail See the workflow about the industry analysis method based on focused crawler technology above, it is no longer detailed to the operation principle of the system It describes in detail bright.

The above is only a preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications Also it should be regarded as protection scope of the present invention.

Claims

1. the industry analysis method based on focused crawler technology, which is characterized in that include the following steps：

Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structural The data information of change；

Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate message is gone Weight, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information；

Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws policy Theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, industry State of the art theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry production capacity The subject dataset of distribution map theme；

By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, shape At industry analysis report.

2. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the focused web is climbed Worm technology to targeted website carry out information scratching the step of be：

It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, it is collected In include all sublinks；

According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, waited for according to degree of correlation screening Link is accessed, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than given threshold Link filters out；

3. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that described according to theme spy The degree of correlation for levying web page contents and theme to be determined in vector determination sublink, is by theme feature vector and webpage to be determined The included angle cosine of corresponding feature vector is weighed.

4. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that the theme feature to Amount is calculated by TF-IDF algorithms, and steps are as follows：

One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a feature , as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms The characteristic item of the website and webpage is converted to n-dimensional space vector, one characteristic item of every one-dimensional correspondence of n-dimensional space vector, generation by weights The weights of table character pair item in the web page, vector space W (d)=(w is converted by n characteristic item₁, w₂, w₃..., w_n), w_i Represent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.

5. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the text classification Gauss hybrid models are used with clustering algorithm.

6. the industry analysis system based on focused crawler technology, which is characterized in that including：

Data acquisition module obtains target industry for carrying out information scratching to targeted website using focused web crawler technology Structuring and non-structured data information；

Data preprocessing module is carried for carrying out page info parsing, data cleansing and content to the data information captured It takes, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information；

Data analysis module extracts target information from useful information, is respectively formed for using text classification and clustering method Including industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry Market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competition The subject dataset of pattern theme, industry production capacity distribution map theme；

Data application module, for by way of document and or chart, content that each subject data is concentrated is into row information Multidimensional visualization shows, and forms industry analysis report.

7. the industry analysis system based on focused crawler technology as claimed in claim 6, which is characterized in that the data application mould Block further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.