CN108334591A - Industry analysis method and system based on focused crawler technology - Google Patents
Industry analysis method and system based on focused crawler technology Download PDFInfo
- Publication number
- CN108334591A CN108334591A CN201810088951.7A CN201810088951A CN108334591A CN 108334591 A CN108334591 A CN 108334591A CN 201810088951 A CN201810088951 A CN 201810088951A CN 108334591 A CN108334591 A CN 108334591A
- Authority
- CN
- China
- Prior art keywords
- industry
- theme
- information
- data
- focused
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/904—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/06—Asset management; Financial planning or analysis
Abstract
The present invention discloses the industry analysis method and system based on focused crawler technology.This method is to carry out information scratching to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structured data information;Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information;Using text classification and clustering algorithm, target information is extracted from useful information, forms the subject dataset of industry;By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, forms industry analysis report.The enterprise that the present invention can assist policy and economic environment, the dynamic of rival and discovery growth residing for industry analyst's monitoring industry fast, and industrial concentration, market scale, growth rate, development trend are analyzed.
Description
Technical field
The present invention relates to industry analysis technical fields, and in particular to a kind of industry analysis method based on focused crawler technology
And system.
Background technology
Industry is the important tool of various investment decisions, and deep industry research is even more to invest successful prerequisite.When
The method that preceding most industries researcher carries out industry research is mainly investigated by internet hunt, on the spot, faces interview
The approach such as what is said or talked about obtain industry data, are then processed analysis according to economic theory and industry experience.Internet is most of
Industry research personnel obtain the most important approach of data, by taking international top consulting firm Mai Kenxi as an example, market survey data
In to have more than 50% obtained from the network approach of increasing income.
Due to the explosive growth of Internet era information, the bottleneck of traditional data acquisition method becomes increasingly conspicuous, industry point
Analysis teacher is increasingly difficult to disposably or simply to inquire its required information in the network data of magnanimity.In this shape
Under gesture, " big data " obtains global scientific and technological circle, industrial circle, government as a kind of emerging data processing technique and Cognitive Thinking
The great attention of department, it is considered to be mass data collection, excavation, analysis powerful, have become grinding for countries in the world
Study carefully forward position and strategical planning emphasis.Big data technology is introduced into industry analysis, is had for innovation and the breakthrough in the field important
Meaning.
Invention content
In view of the technical drawbacks of the prior art, it is an object of the present invention to provide the rows based on focused crawler technology
Industry analysis method and system.
The technical solution adopted to achieve the purpose of the present invention is:
Industry analysis method based on focused crawler technology, includes the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non-
The data information of structuring;
Page info parsing, data cleansing and contents extraction are carried out to the data information that captures, to duplicate message into
Row duplicate removal, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws
Policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme,
Industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry
The subject dataset of production capacity distribution map theme;
By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional
It is existing, form industry analysis report.
Wherein, the focused web crawler technology is to the step of targeted website progress information scratching:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, is searched
Collect all sublinks wherein included;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation
Link to be visited is selected, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than setting threshold
The link of value filters out;
Start Theme Crawler of Content, executing data to the link in url list to be visited successively crawls.
Wherein, in the vector determination sublink according to theme feature web page contents and theme to be determined the degree of correlation, be
It is weighed by the included angle cosine of theme feature vector feature vector corresponding with webpage to be determined.
Wherein, the theme feature vector is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a spy
Item is levied, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms
Weights, the characteristic item of the website and webpage is converted into n-dimensional space vector, n-dimensional space vector per one-dimensional correspondence one characteristic item,
The weights of character pair item in the web page are represented, vector space W (d)=(w is converted by n characteristic item1, w2, w3...,
wn), wiRepresent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
Wherein, the text classification and clustering algorithm use gauss hybrid models.
The present invention also aims to provide the industry analysis system based on focused crawler technology, including:
Data acquisition module obtains target for carrying out information scratching to targeted website using focused web crawler technology
The structuring of industry and non-structured data information;
Data preprocessing module, for carrying out page info parsing, data cleansing and interior to the data information that captures
Hold extraction, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful letter
Breath;
Data analysis module extracts target information, respectively for using text classification and clustering method from useful information
Formed comprising industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development,
Industry market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry
The subject dataset of competition situation theme, industry production capacity distribution map theme;
Data application module, the content for by way of document and or chart, each subject data to be concentrated carry out
The visualization of information multidimensional shows, and forms industry analysis report.
The data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.
Industry analysis method based on focused crawler technology proposed by the invention, by focused crawler technology from target network
It stands etc. structure/non-structural data of crawl specific industry, text classification and cluster then is carried out to the content after crawl, respectively
Formation includes law & policy theme, economic environment theme, market capacity theme, state of the art theme, enterprise production sales volume master
Multiple subject datasets including topic, competition situation theme, the content for then concentrating each subject data is with document and chart
Form show, automatically generate industry report, so as to assist industry analyst monitor industry residing for policy and economic ring
Border, rival dynamic and find the fast enterprise of growing up, and become to industrial concentration, market scale, growth rate, development
Gesture is analyzed.
Description of the drawings
Fig. 1 is the workflow schematic diagram of the industry analysis method based on focused crawler technology;
Fig. 2 is the flow chart that focused web crawler carries out targeted website information scratching;
Fig. 3 is the principle schematic of the industry analysis system based on focused crawler technology.
Specific implementation mode
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It should be appreciated that described herein
Specific embodiment be only used to explain the present invention, be not intended to limit the present invention.
Shown in Fig. 1-2, the industry analysis method based on focused crawler technology includes the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non-
The data information of structuring;
Page info parsing, data cleansing and contents extraction are carried out to the data information that captures, to duplicate message into
Row duplicate removal, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws
Policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme,
Industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry
The subject dataset of production capacity distribution map theme;
By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional
It is existing, form the report of industry analysis primary.
Wherein, can be carried out specially by corresponding algorithm and artificial means designated key word when extracting target information
Topic detecting, refine and find industrial life cycle, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy,
The target informations such as market capacity, to form corresponding subject dataset.
Wherein, the data source of the information captured is to increase income data, including internet is increased income data, think tank's business datum
And the energy, environment, policy, law, economic research data etc., by the acquisition of the data to these, formed structuring and
Non-structured data information, can be stored in distributed frame in non-structural database, the big data point for data information
Analysis is handled.
It,, can using distributed frame/non-structural database when the data to acquisition carry out storage processing in the present invention
By building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop, structuring number is supported
According to, distributed storage and the parallel computation of semi-structured data and unstructured data, realize PB grades of multi-source heterogeneous big datas
Batch and Stream Processing, the data of separate sources are cleaned by Hadoop interactive modules, format judge, be uploaded to HDFS into
Row storage is handled.
The present invention is by will be including the big data skill including focused crawler technology, distributed storage technology, cloud computing technology
Art is combined with industry research, by building the data sampling and processing and mining analysis platform of Industry-oriented analysis field, is used
Technological means is acquired, arranges and stores to increase income data, think tank's business datum, manufacturing enterprise's data of internet, by poly-
The mining analysis such as class, association, recurrence refine valuable information, are given birth to state-of-art, the industry of Object Industry at home and abroad
Order period, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy, market capacity, market entrance and exit
The problems such as difficulty and detailed programs feasibility, carries out intellectual analysis, may be implemented to industry overall development situation make comprehensively,
Objective, quantization evaluation.
In the present invention, particular row is built using the PEST models in industry analysis by using focused web crawler technology
Owner inscribes benchmark model, determines that Theme Crawler of Content needs the link range crawled, guides reptile continuous by certain search strategy
New theme related pages are excavated, finally carry out web page analysis to crawling page set, purposefully, are selectively adopted from targeted website
Collect the structures such as text, the picture in relation to specific industry and non-structural data.
Wherein, the focused web crawler technology is to the step of targeted website progress information scratching:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, is searched
Collect all sublinks wherein included;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation
Link to be visited is selected, is classified by mark value to the link set by relatedness computation processing, relevance degree is reached
Url list to be visited is added in the link of given threshold, and the link that the degree of correlation is less than to given threshold filters out;
Start Theme Crawler of Content, successively in url list to be visited link execute data crawl, others link no longer into
Row processing.The condition that reptile is terminated is that url list to be visited is that empty or reptile has crawled enough theme related pages.
Wherein, the theme feature vector is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a spy
Item is levied, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms
Weights, the characteristic item of the website and webpage is converted into n-dimensional space vector, n-dimensional space vector per one-dimensional correspondence one characteristic item,
The weights of character pair item in the web page are represented, vector space W (d)=(w is converted by n characteristic item1, w2, w3...,
wn), wiRepresent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
Wherein, the keyword set can determine in the following manner, and artificial or intellectual analysis method is taken to collect
The higher vocabulary of the frequency of occurrences in industry analysis field, such as " XX industries ", " yield ", " sales volume ", " disappear and sell volume ", " production capacity utilizes
Rate ", " industrial concentration ", " occupation rate of market " etc. form initial key word set;By the entry in initial key word set according to net
Stand, manufacturer, product, the columns such as area are classified, form the keyword set or lists of keywords of the specific industry for needing to monitor,
Then according to each entry of the keyword set, theme feature vector is calculated by above-mentioned TF-IDF algorithms.
The TF-IDF algorithms are the common weighting techniques prospected for information retrieval and information, to assess a certain keyword
For the significance level of a webpage.The importance of keyword is but same with the directly proportional increase of number that it occurs in webpage
When the frequency that can occur in corpus with it be inversely proportional decline.Its principle is as follows:
In a given webpage, word frequency (Term Frequency, TF) refers to that some given keyword exists
The number occurred in the webpage.This number would generally be normalized, to prevent it to be biased to long webpage.(because of the same pass
Keyword may have higher word frequency in long webpage than short webpage, whether important but regardless of the keyword.)
Reverse document-frequency (Inverse Document Frequency, IDF) is a keyword general importance
Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action
To quotient take that logarithm obtains the high keyword frequency in a certain particular webpage and the keyword is low in entire collections of web pages
Webpage frequency can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common keyword, retains weight
The keyword wanted.
In given webpage, word frequency (term frequency, TF) refers to some given keyword in the web page
The frequency of appearance.This numerical value is the normalization to word number (term count), to prevent it to be biased to long webpage.(because same
One keyword may have higher word number in long webpage than short webpage, whether important but regardless of the keyword.) to Mr. Yu
I-th of keyword t in particular webpagei, importance is expressed as:
In formula, ni,jIt is i-th of keyword tiIn j-th of webpage djIn occurrence number, and denominator is then in webpage dj
The sum of the occurrence number of all words.
Reverse document-frequency (inverse document frequency, IDF) is a keyword general importance
Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action
To quotient take logarithm to obtain:
Wherein | D | it is total for related web page, | { j:ti∈dj| it includes i-th of keyword t to beiWebpage number, if not having
Related web page with the keyword, may result in dividend is zero, therefore is used under normal circumstances | { j:ti∈dj}|+1
Finally calculate TFIDF:
TFIDFi,j=TFi,j×IDFi
Wherein, in the vector determination sublink according to theme feature web page contents and theme to be determined the degree of correlation, be
It is weighed by the included angle cosine of theme feature vector feature vector corresponding with webpage to be determined.
It first passes through TF-IDF algorithms and calculates the corresponding feature vector, Xs (d) of webpage d to be discriminated, be denoted as:
X (d)=(x1(d), x2(d), x3(d) ..., xn(d)),
Wherein xnRepresent feature vector (i.e. weights) of n-th of characteristic item in webpage d to be discriminated.If the page to be discriminated with
Topic correlativity is sim,
In formula, xnIndicate corresponding n-th of the feature vector of currently processed webpage, wnIndicate n-th of theme feature vector.When
When sim is more than or equal to specified relevance threshold, it can just assert that the currently processed page is the theme related pages.
Wherein, the text classification and clustering algorithm use gauss hybrid models (GMM).
GMM algorithms be one is being overlapped generation by multiple Gaussian distribution models, be to one kind of probability density often
See description, is described as follows shown in formula:
Wherein:
In above formula, μgIt is the mean value of Gaussian Profile, ΣgIt is the covariance of Gaussian Profile, exp-exponential function,
ωgIt is the weight of a Gaussian Profile, 0≤ω need to be metg≤ 1, meetNgIt is total in the model
How many a Gaussian Profiles are shared, the number of Gaussian Profile in N-model.The step of GMM algorithms, is as follows:
1) training models under enough training datas, find the μ in k Gauss modelg, ΣgInitial value;
2) and xiCorresponding estimated data is as follows in the probability that k-th of Gauss model generates:
γ (i, k) and μk,ΣkCan be obtained by a variety of methods, most common method be so that maximum likelihood function most
Big solution.
3) estimates multiple parameters, and applies maximum likelihood:
In above formula,Thus in the formula in also can be obtained the 2)
It is last constantly to execute step 2) again and 3) until all characteristic point positions are motionless.
The present invention also aims to provide the industry analysis system based on focused crawler technology, including:
Data acquisition module obtains target for carrying out information scratching to targeted website using focused web crawler technology
The structuring of industry and non-structured data information;
Data preprocessing module, for carrying out page info parsing, data cleansing and interior to the data information that captures
Hold extraction, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful letter
Breath;
Data analysis module extracts target information, respectively for using text classification and clustering method from useful information
Formed comprising industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development,
Industry market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry
The subject dataset of competition situation theme, industry production capacity distribution map theme;
Data application module, the content for by way of document and or chart, each subject data to be concentrated carry out
The visualization of information multidimensional shows, and forms industry analysis report.
Wherein, the data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user,
Then the theme that corresponding theme is concentrated is exported according to retrieval information by system, the form by picture or document can be passed through
To show user to use.
Industry analysis system of the offer based on focused crawler technology further includes being distributed formula structure/non-structural data
Library can support knot by building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop
Structure data, the distributed storage of semi-structured data and unstructured data and parallel computation are realized PB grades multi-source heterogeneous big
The batch and Stream Processing of data clean the data of separate sources by Hadoop interactive modules, format judgement, are uploaded to
HDFS is stored.
The operation principle and procedure declaration of the above-described industry analysis system based on focused crawler technology, are asked in detail
See the workflow about the industry analysis method based on focused crawler technology above, it is no longer detailed to the operation principle of the system
It describes in detail bright.
The above is only a preferred embodiment of the present invention, it is noted that for the common skill of the art
For art personnel, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications
Also it should be regarded as protection scope of the present invention.
Claims (7)
1. the industry analysis method based on focused crawler technology, which is characterized in that include the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structural
The data information of change;
Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate message is gone
Weight, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws policy
Theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, industry
State of the art theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry production capacity
The subject dataset of distribution map theme;
By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, shape
At industry analysis report.
2. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the focused web is climbed
Worm technology to targeted website carry out information scratching the step of be:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, it is collected
In include all sublinks;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, waited for according to degree of correlation screening
Link is accessed, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than given threshold
Link filters out;
Start Theme Crawler of Content, executing data to the link in url list to be visited successively crawls.
3. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that described according to theme spy
The degree of correlation for levying web page contents and theme to be determined in vector determination sublink, is by theme feature vector and webpage to be determined
The included angle cosine of corresponding feature vector is weighed.
4. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that the theme feature to
Amount is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a feature
, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms
The characteristic item of the website and webpage is converted to n-dimensional space vector, one characteristic item of every one-dimensional correspondence of n-dimensional space vector, generation by weights
The weights of table character pair item in the web page, vector space W (d)=(w is converted by n characteristic item1, w2, w3..., wn), wi
Represent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
5. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the text classification
Gauss hybrid models are used with clustering algorithm.
6. the industry analysis system based on focused crawler technology, which is characterized in that including:
Data acquisition module obtains target industry for carrying out information scratching to targeted website using focused web crawler technology
Structuring and non-structured data information;
Data preprocessing module is carried for carrying out page info parsing, data cleansing and content to the data information captured
It takes, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information;
Data analysis module extracts target information from useful information, is respectively formed for using text classification and clustering method
Including industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry
Market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competition
The subject dataset of pattern theme, industry production capacity distribution map theme;
Data application module, for by way of document and or chart, content that each subject data is concentrated is into row information
Multidimensional visualization shows, and forms industry analysis report.
7. the industry analysis system based on focused crawler technology as claimed in claim 6, which is characterized in that the data application mould
Block further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810088951.7A CN108334591A (en) | 2018-01-30 | 2018-01-30 | Industry analysis method and system based on focused crawler technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810088951.7A CN108334591A (en) | 2018-01-30 | 2018-01-30 | Industry analysis method and system based on focused crawler technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108334591A true CN108334591A (en) | 2018-07-27 |
Family
ID=62926710
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810088951.7A Pending CN108334591A (en) | 2018-01-30 | 2018-01-30 | Industry analysis method and system based on focused crawler technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108334591A (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165349A (en) * | 2018-08-22 | 2019-01-08 | 南京涌亿思信息技术有限公司 | Securities data monitoring method, apparatus and system |
CN109325860A (en) * | 2018-08-29 | 2019-02-12 | 中国科学院自动化研究所 | Network public-opinion detection method and system for overseas investment Risk-warning |
CN109493265A (en) * | 2018-11-05 | 2019-03-19 | 北京奥法科技有限公司 | A kind of Policy Interpretation method and Policy Interpretation system based on deep learning |
CN109522359A (en) * | 2018-11-02 | 2019-03-26 | 大连瀚闻资讯有限公司 | Visualization industrial analysis method based on big data |
CN109597928A (en) * | 2018-12-05 | 2019-04-09 | 云南电网有限责任公司信息中心 | Support the non-structured text acquisition methods based on Web network of subscriber policy configuration |
CN109684480A (en) * | 2018-12-30 | 2019-04-26 | 杭州翼兔网络科技有限公司 | A kind of clustering method based on industry |
CN109740102A (en) * | 2019-01-24 | 2019-05-10 | 国家体育总局体育科学研究所 | A kind of body-building system and method based on knowledge of keeping fit library |
CN110009128A (en) * | 2019-01-28 | 2019-07-12 | 平安科技(深圳)有限公司 | Industry public opinion index prediction technique, device, computer equipment and storage medium |
CN110020092A (en) * | 2018-11-20 | 2019-07-16 | 皮商云集(厦门)科技有限公司 | Leather industry data center systems based on web crawlers |
CN110189170A (en) * | 2019-05-27 | 2019-08-30 | 中译语通科技股份有限公司 | Market sentiment analysis method and system |
CN110297961A (en) * | 2019-06-26 | 2019-10-01 | 广州博士信息技术研究院有限公司 | A kind of Quick Acquisition of policy information and optimization extracting method |
CN110400101A (en) * | 2019-08-21 | 2019-11-01 | 苏州经贸职业技术学院 | Industry reports analysis system and method |
CN110413861A (en) * | 2019-07-23 | 2019-11-05 | 中南民族大学 | Link extracting method, device, equipment and storage medium based on web crawlers |
CN110704403A (en) * | 2019-08-27 | 2020-01-17 | 北京国联视讯信息技术股份有限公司 | Data acquisition and analysis system and method based on cloud computing |
CN110837595A (en) * | 2019-11-05 | 2020-02-25 | 北京市燃气集团有限责任公司 | Enterprise information data processing method, system, terminal and storage medium |
CN110851562A (en) * | 2019-08-19 | 2020-02-28 | 湖南正宇软件技术开发有限公司 | Information acquisition method, system, equipment and storage medium |
CN110852085A (en) * | 2019-08-19 | 2020-02-28 | 湖南正宇软件技术开发有限公司 | Hotspot topic mining method and system |
CN110992168A (en) * | 2019-11-29 | 2020-04-10 | 交通银行股份有限公司 | Bank internal and external data fusion method and system |
CN111027318A (en) * | 2019-10-12 | 2020-04-17 | 中国平安财产保险股份有限公司 | Industry classification method, device, equipment and storage medium based on big data |
CN111476030A (en) * | 2020-05-08 | 2020-07-31 | 中国科学院计算机网络信息中心 | Prospective factor screening method based on deep learning |
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
CN112104656A (en) * | 2020-09-16 | 2020-12-18 | 杭州安恒信息安全技术有限公司 | Network threat data acquisition method, device, equipment and medium |
CN112231535A (en) * | 2020-10-23 | 2021-01-15 | 山东科技大学 | Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium |
CN113312343A (en) * | 2021-06-11 | 2021-08-27 | 北京思特奇信息技术股份有限公司 | Business opportunity management method and system based on web crawler tool |
CN113313407A (en) * | 2021-06-16 | 2021-08-27 | 上海交通大学 | Enterprise power utilization behavior identification method and device |
CN113726900A (en) * | 2021-09-02 | 2021-11-30 | 四川启睿克科技有限公司 | System for judging age bracket of user child |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
CN105550365A (en) * | 2016-01-15 | 2016-05-04 | 中国科学院自动化研究所 | Visualization analysis system based on text topic model |
CN106339378A (en) * | 2015-07-07 | 2017-01-18 | 中国科学院信息工程研究所 | Data collecting method based on keyword oriented topic web crawlers |
CN106709052A (en) * | 2017-01-06 | 2017-05-24 | 电子科技大学 | Keyword based topic-focused web crawler design method |
CN107103043A (en) * | 2017-03-29 | 2017-08-29 | 国信优易数据有限公司 | A kind of Text Clustering Method and system |
-
2018
- 2018-01-30 CN CN201810088951.7A patent/CN108334591A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
CN106339378A (en) * | 2015-07-07 | 2017-01-18 | 中国科学院信息工程研究所 | Data collecting method based on keyword oriented topic web crawlers |
CN105550365A (en) * | 2016-01-15 | 2016-05-04 | 中国科学院自动化研究所 | Visualization analysis system based on text topic model |
CN106709052A (en) * | 2017-01-06 | 2017-05-24 | 电子科技大学 | Keyword based topic-focused web crawler design method |
CN107103043A (en) * | 2017-03-29 | 2017-08-29 | 国信优易数据有限公司 | A kind of Text Clustering Method and system |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165349A (en) * | 2018-08-22 | 2019-01-08 | 南京涌亿思信息技术有限公司 | Securities data monitoring method, apparatus and system |
CN109325860A (en) * | 2018-08-29 | 2019-02-12 | 中国科学院自动化研究所 | Network public-opinion detection method and system for overseas investment Risk-warning |
CN109522359A (en) * | 2018-11-02 | 2019-03-26 | 大连瀚闻资讯有限公司 | Visualization industrial analysis method based on big data |
CN109493265A (en) * | 2018-11-05 | 2019-03-19 | 北京奥法科技有限公司 | A kind of Policy Interpretation method and Policy Interpretation system based on deep learning |
CN110020092A (en) * | 2018-11-20 | 2019-07-16 | 皮商云集(厦门)科技有限公司 | Leather industry data center systems based on web crawlers |
CN109597928A (en) * | 2018-12-05 | 2019-04-09 | 云南电网有限责任公司信息中心 | Support the non-structured text acquisition methods based on Web network of subscriber policy configuration |
CN109597928B (en) * | 2018-12-05 | 2022-12-16 | 云南电网有限责任公司信息中心 | Unstructured text acquisition method supporting user policy configuration and based on Web network |
CN109684480A (en) * | 2018-12-30 | 2019-04-26 | 杭州翼兔网络科技有限公司 | A kind of clustering method based on industry |
CN109684480B (en) * | 2018-12-30 | 2021-01-05 | 北京人民在线网络有限公司 | Industry-based clustering method |
CN109740102A (en) * | 2019-01-24 | 2019-05-10 | 国家体育总局体育科学研究所 | A kind of body-building system and method based on knowledge of keeping fit library |
CN110009128A (en) * | 2019-01-28 | 2019-07-12 | 平安科技(深圳)有限公司 | Industry public opinion index prediction technique, device, computer equipment and storage medium |
CN110189170A (en) * | 2019-05-27 | 2019-08-30 | 中译语通科技股份有限公司 | Market sentiment analysis method and system |
CN110297961A (en) * | 2019-06-26 | 2019-10-01 | 广州博士信息技术研究院有限公司 | A kind of Quick Acquisition of policy information and optimization extracting method |
CN110413861A (en) * | 2019-07-23 | 2019-11-05 | 中南民族大学 | Link extracting method, device, equipment and storage medium based on web crawlers |
CN110413861B (en) * | 2019-07-23 | 2021-10-22 | 中南民族大学 | Link extraction method, device, equipment and storage medium based on web crawler |
CN110851562A (en) * | 2019-08-19 | 2020-02-28 | 湖南正宇软件技术开发有限公司 | Information acquisition method, system, equipment and storage medium |
CN110852085A (en) * | 2019-08-19 | 2020-02-28 | 湖南正宇软件技术开发有限公司 | Hotspot topic mining method and system |
CN110400101A (en) * | 2019-08-21 | 2019-11-01 | 苏州经贸职业技术学院 | Industry reports analysis system and method |
CN110704403A (en) * | 2019-08-27 | 2020-01-17 | 北京国联视讯信息技术股份有限公司 | Data acquisition and analysis system and method based on cloud computing |
CN111027318A (en) * | 2019-10-12 | 2020-04-17 | 中国平安财产保险股份有限公司 | Industry classification method, device, equipment and storage medium based on big data |
CN110837595A (en) * | 2019-11-05 | 2020-02-25 | 北京市燃气集团有限责任公司 | Enterprise information data processing method, system, terminal and storage medium |
CN110992168A (en) * | 2019-11-29 | 2020-04-10 | 交通银行股份有限公司 | Bank internal and external data fusion method and system |
CN111476030A (en) * | 2020-05-08 | 2020-07-31 | 中国科学院计算机网络信息中心 | Prospective factor screening method based on deep learning |
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
CN111767482B (en) * | 2020-05-21 | 2023-06-06 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawlers |
CN112104656A (en) * | 2020-09-16 | 2020-12-18 | 杭州安恒信息安全技术有限公司 | Network threat data acquisition method, device, equipment and medium |
CN112231535A (en) * | 2020-10-23 | 2021-01-15 | 山东科技大学 | Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium |
CN112231535B (en) * | 2020-10-23 | 2022-11-15 | 山东科技大学 | Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium |
CN113312343A (en) * | 2021-06-11 | 2021-08-27 | 北京思特奇信息技术股份有限公司 | Business opportunity management method and system based on web crawler tool |
CN113313407A (en) * | 2021-06-16 | 2021-08-27 | 上海交通大学 | Enterprise power utilization behavior identification method and device |
CN113726900A (en) * | 2021-09-02 | 2021-11-30 | 四川启睿克科技有限公司 | System for judging age bracket of user child |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108334591A (en) | Industry analysis method and system based on focused crawler technology | |
Wu et al. | DeepDetect: A cascaded region-based densely connected network for seismic event detection | |
Yu et al. | Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering | |
CN111737495A (en) | Middle-high-end talent intelligent recommendation system and method based on domain self-classification | |
CN106951498A (en) | Text clustering method | |
Hordri et al. | A systematic literature review on features of deep learning in big data analytics | |
CN108717408A (en) | A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system | |
CN109543067A (en) | Enterprise's production status based on artificial intelligence monitors analysis system in real time | |
CN107291895B (en) | Quick hierarchical document query method | |
CN110543595A (en) | in-station search system and method | |
Sarwar et al. | A survey of big data analytics in healthcare | |
Sugiharti et al. | Predictive evaluation of performance of computer science students of unnes using data mining based on naÏve bayes classifier (NBC) algorithm | |
CN113688635A (en) | Semantic similarity based class case recommendation method | |
CN106570196B (en) | Video program searching method and device | |
Kiran et al. | Prediction analysis of crime in India using a hybrid clustering approach | |
CN111144453A (en) | Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data | |
CN114707068A (en) | Method, device, equipment and medium for recommending intelligence base knowledge | |
Utama et al. | SCIENTIFIC ARTICLES RECOMMENDATION SYSTEM BASED ON USER’S RELATEDNESS USING ITEM-BASED COLLABORATIVE FILTERING METHOD | |
Alhassan et al. | Using data mining technique for scholarship disbursement | |
Alhaj et al. | Predicting user entries by using data mining algorithms | |
CN113378023A (en) | Visual system for mining and comparing public opinion and news information of people | |
CN106919700A (en) | Semantics-driven crime clue real-time recommendation method based on parallelization CEP treatment | |
Revathy et al. | Classifying Agricultural Crop PestData Using Hadoop MapReduceBased C5. 0 Algorithm. | |
Boddu | ELIMINATE THE NOISY DATA FROM WEB PAGES USING DATA MINING TECHNIQUES. | |
Bhandari et al. | Enhanced Apriori Algorithm model in course suggestion system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180727 |