CN108334591A - Industry analysis method and system based on focused crawler technology - Google Patents

Industry analysis method and system based on focused crawler technology Download PDF

Info

Publication number
CN108334591A
CN108334591A CN201810088951.7A CN201810088951A CN108334591A CN 108334591 A CN108334591 A CN 108334591A CN 201810088951 A CN201810088951 A CN 201810088951A CN 108334591 A CN108334591 A CN 108334591A
Authority
CN
China
Prior art keywords
industry
theme
information
data
focused
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810088951.7A
Other languages
Chinese (zh)
Inventor
薛文芳
韩艳超
张德馨
郑浩楠
薛金鸽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd
Original Assignee
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd filed Critical Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co Ltd
Priority to CN201810088951.7A priority Critical patent/CN108334591A/en
Publication of CN108334591A publication Critical patent/CN108334591A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Abstract

The present invention discloses the industry analysis method and system based on focused crawler technology.This method is to carry out information scratching to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structured data information;Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information;Using text classification and clustering algorithm, target information is extracted from useful information, forms the subject dataset of industry;By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, forms industry analysis report.The enterprise that the present invention can assist policy and economic environment, the dynamic of rival and discovery growth residing for industry analyst's monitoring industry fast, and industrial concentration, market scale, growth rate, development trend are analyzed.

Description

Industry analysis method and system based on focused crawler technology
Technical field
The present invention relates to industry analysis technical fields, and in particular to a kind of industry analysis method based on focused crawler technology And system.
Background technology
Industry is the important tool of various investment decisions, and deep industry research is even more to invest successful prerequisite.When The method that preceding most industries researcher carries out industry research is mainly investigated by internet hunt, on the spot, faces interview The approach such as what is said or talked about obtain industry data, are then processed analysis according to economic theory and industry experience.Internet is most of Industry research personnel obtain the most important approach of data, by taking international top consulting firm Mai Kenxi as an example, market survey data In to have more than 50% obtained from the network approach of increasing income.
Due to the explosive growth of Internet era information, the bottleneck of traditional data acquisition method becomes increasingly conspicuous, industry point Analysis teacher is increasingly difficult to disposably or simply to inquire its required information in the network data of magnanimity.In this shape Under gesture, " big data " obtains global scientific and technological circle, industrial circle, government as a kind of emerging data processing technique and Cognitive Thinking The great attention of department, it is considered to be mass data collection, excavation, analysis powerful, have become grinding for countries in the world Study carefully forward position and strategical planning emphasis.Big data technology is introduced into industry analysis, is had for innovation and the breakthrough in the field important Meaning.
Invention content
In view of the technical drawbacks of the prior art, it is an object of the present invention to provide the rows based on focused crawler technology Industry analysis method and system.
The technical solution adopted to achieve the purpose of the present invention is:
Industry analysis method based on focused crawler technology, includes the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non- The data information of structuring;
Page info parsing, data cleansing and contents extraction are carried out to the data information that captures, to duplicate message into Row duplicate removal, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws Policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, Industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry The subject dataset of production capacity distribution map theme;
By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional It is existing, form industry analysis report.
Wherein, the focused web crawler technology is to the step of targeted website progress information scratching:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, is searched Collect all sublinks wherein included;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation Link to be visited is selected, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than setting threshold The link of value filters out;
Start Theme Crawler of Content, executing data to the link in url list to be visited successively crawls.
Wherein, in the vector determination sublink according to theme feature web page contents and theme to be determined the degree of correlation, be It is weighed by the included angle cosine of theme feature vector feature vector corresponding with webpage to be determined.
Wherein, the theme feature vector is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a spy Item is levied, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms Weights, the characteristic item of the website and webpage is converted into n-dimensional space vector, n-dimensional space vector per one-dimensional correspondence one characteristic item, The weights of character pair item in the web page are represented, vector space W (d)=(w is converted by n characteristic item1, w2, w3..., wn), wiRepresent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
Wherein, the text classification and clustering algorithm use gauss hybrid models.
The present invention also aims to provide the industry analysis system based on focused crawler technology, including:
Data acquisition module obtains target for carrying out information scratching to targeted website using focused web crawler technology The structuring of industry and non-structured data information;
Data preprocessing module, for carrying out page info parsing, data cleansing and interior to the data information that captures Hold extraction, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful letter Breath;
Data analysis module extracts target information, respectively for using text classification and clustering method from useful information Formed comprising industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, Industry market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry The subject dataset of competition situation theme, industry production capacity distribution map theme;
Data application module, the content for by way of document and or chart, each subject data to be concentrated carry out The visualization of information multidimensional shows, and forms industry analysis report.
The data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.
Industry analysis method based on focused crawler technology proposed by the invention, by focused crawler technology from target network It stands etc. structure/non-structural data of crawl specific industry, text classification and cluster then is carried out to the content after crawl, respectively Formation includes law & policy theme, economic environment theme, market capacity theme, state of the art theme, enterprise production sales volume master Multiple subject datasets including topic, competition situation theme, the content for then concentrating each subject data is with document and chart Form show, automatically generate industry report, so as to assist industry analyst monitor industry residing for policy and economic ring Border, rival dynamic and find the fast enterprise of growing up, and become to industrial concentration, market scale, growth rate, development Gesture is analyzed.
Description of the drawings
Fig. 1 is the workflow schematic diagram of the industry analysis method based on focused crawler technology;
Fig. 2 is the flow chart that focused web crawler carries out targeted website information scratching;
Fig. 3 is the principle schematic of the industry analysis system based on focused crawler technology.
Specific implementation mode
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It should be appreciated that described herein Specific embodiment be only used to explain the present invention, be not intended to limit the present invention.
Shown in Fig. 1-2, the industry analysis method based on focused crawler technology includes the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non- The data information of structuring;
Page info parsing, data cleansing and contents extraction are carried out to the data information that captures, to duplicate message into Row duplicate removal, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws Policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, Industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry The subject dataset of production capacity distribution map theme;
By way of document and or chart, the content that each subject data is concentrated visualizes exhibition into row information multidimensional It is existing, form the report of industry analysis primary.
Wherein, can be carried out specially by corresponding algorithm and artificial means designated key word when extracting target information Topic detecting, refine and find industrial life cycle, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy, The target informations such as market capacity, to form corresponding subject dataset.
Wherein, the data source of the information captured is to increase income data, including internet is increased income data, think tank's business datum And the energy, environment, policy, law, economic research data etc., by the acquisition of the data to these, formed structuring and Non-structured data information, can be stored in distributed frame in non-structural database, the big data point for data information Analysis is handled.
It,, can using distributed frame/non-structural database when the data to acquisition carry out storage processing in the present invention By building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop, structuring number is supported According to, distributed storage and the parallel computation of semi-structured data and unstructured data, realize PB grades of multi-source heterogeneous big datas Batch and Stream Processing, the data of separate sources are cleaned by Hadoop interactive modules, format judge, be uploaded to HDFS into Row storage is handled.
The present invention is by will be including the big data skill including focused crawler technology, distributed storage technology, cloud computing technology Art is combined with industry research, by building the data sampling and processing and mining analysis platform of Industry-oriented analysis field, is used Technological means is acquired, arranges and stores to increase income data, think tank's business datum, manufacturing enterprise's data of internet, by poly- The mining analysis such as class, association, recurrence refine valuable information, are given birth to state-of-art, the industry of Object Industry at home and abroad Order period, the trend of industry development, industry distribution map, the rate of capacity utilization, industrial policy, market capacity, market entrance and exit The problems such as difficulty and detailed programs feasibility, carries out intellectual analysis, may be implemented to industry overall development situation make comprehensively, Objective, quantization evaluation.
In the present invention, particular row is built using the PEST models in industry analysis by using focused web crawler technology Owner inscribes benchmark model, determines that Theme Crawler of Content needs the link range crawled, guides reptile continuous by certain search strategy New theme related pages are excavated, finally carry out web page analysis to crawling page set, purposefully, are selectively adopted from targeted website Collect the structures such as text, the picture in relation to specific industry and non-structural data.
Wherein, the focused web crawler technology is to the step of targeted website progress information scratching:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, is searched Collect all sublinks wherein included;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, sieved according to the degree of correlation Link to be visited is selected, is classified by mark value to the link set by relatedness computation processing, relevance degree is reached Url list to be visited is added in the link of given threshold, and the link that the degree of correlation is less than to given threshold filters out;
Start Theme Crawler of Content, successively in url list to be visited link execute data crawl, others link no longer into Row processing.The condition that reptile is terminated is that url list to be visited is that empty or reptile has crawled enough theme related pages.
Wherein, the theme feature vector is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a spy Item is levied, as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms Weights, the characteristic item of the website and webpage is converted into n-dimensional space vector, n-dimensional space vector per one-dimensional correspondence one characteristic item, The weights of character pair item in the web page are represented, vector space W (d)=(w is converted by n characteristic item1, w2, w3..., wn), wiRepresent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
Wherein, the keyword set can determine in the following manner, and artificial or intellectual analysis method is taken to collect The higher vocabulary of the frequency of occurrences in industry analysis field, such as " XX industries ", " yield ", " sales volume ", " disappear and sell volume ", " production capacity utilizes Rate ", " industrial concentration ", " occupation rate of market " etc. form initial key word set;By the entry in initial key word set according to net Stand, manufacturer, product, the columns such as area are classified, form the keyword set or lists of keywords of the specific industry for needing to monitor, Then according to each entry of the keyword set, theme feature vector is calculated by above-mentioned TF-IDF algorithms.
The TF-IDF algorithms are the common weighting techniques prospected for information retrieval and information, to assess a certain keyword For the significance level of a webpage.The importance of keyword is but same with the directly proportional increase of number that it occurs in webpage When the frequency that can occur in corpus with it be inversely proportional decline.Its principle is as follows:
In a given webpage, word frequency (Term Frequency, TF) refers to that some given keyword exists The number occurred in the webpage.This number would generally be normalized, to prevent it to be biased to long webpage.(because of the same pass Keyword may have higher word frequency in long webpage than short webpage, whether important but regardless of the keyword.)
Reverse document-frequency (Inverse Document Frequency, IDF) is a keyword general importance Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action To quotient take that logarithm obtains the high keyword frequency in a certain particular webpage and the keyword is low in entire collections of web pages Webpage frequency can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common keyword, retains weight The keyword wanted.
In given webpage, word frequency (term frequency, TF) refers to some given keyword in the web page The frequency of appearance.This numerical value is the normalization to word number (term count), to prevent it to be biased to long webpage.(because same One keyword may have higher word number in long webpage than short webpage, whether important but regardless of the keyword.) to Mr. Yu I-th of keyword t in particular webpagei, importance is expressed as:
In formula, ni,jIt is i-th of keyword tiIn j-th of webpage djIn occurrence number, and denominator is then in webpage dj The sum of the occurrence number of all words.
Reverse document-frequency (inverse document frequency, IDF) is a keyword general importance Measurement.The IDF of a certain particular keywords, can be by total webpage number divided by the number of the webpage comprising the keyword, then incites somebody to action To quotient take logarithm to obtain:
Wherein | D | it is total for related web page, | { j:ti∈dj| it includes i-th of keyword t to beiWebpage number, if not having Related web page with the keyword, may result in dividend is zero, therefore is used under normal circumstances | { j:ti∈dj}|+1
Finally calculate TFIDF:
TFIDFi,j=TFi,j×IDFi
Wherein, in the vector determination sublink according to theme feature web page contents and theme to be determined the degree of correlation, be It is weighed by the included angle cosine of theme feature vector feature vector corresponding with webpage to be determined.
It first passes through TF-IDF algorithms and calculates the corresponding feature vector, Xs (d) of webpage d to be discriminated, be denoted as:
X (d)=(x1(d), x2(d), x3(d) ..., xn(d)),
Wherein xnRepresent feature vector (i.e. weights) of n-th of characteristic item in webpage d to be discriminated.If the page to be discriminated with Topic correlativity is sim,
In formula, xnIndicate corresponding n-th of the feature vector of currently processed webpage, wnIndicate n-th of theme feature vector.When When sim is more than or equal to specified relevance threshold, it can just assert that the currently processed page is the theme related pages.
Wherein, the text classification and clustering algorithm use gauss hybrid models (GMM).
GMM algorithms be one is being overlapped generation by multiple Gaussian distribution models, be to one kind of probability density often See description, is described as follows shown in formula:
Wherein:
In above formula, μgIt is the mean value of Gaussian Profile, ΣgIt is the covariance of Gaussian Profile, exp-exponential function,
ωgIt is the weight of a Gaussian Profile, 0≤ω need to be metg≤ 1, meetNgIt is total in the model How many a Gaussian Profiles are shared, the number of Gaussian Profile in N-model.The step of GMM algorithms, is as follows:
1) training models under enough training datas, find the μ in k Gauss modelg, ΣgInitial value;
2) and xiCorresponding estimated data is as follows in the probability that k-th of Gauss model generates:
γ (i, k) and μkkCan be obtained by a variety of methods, most common method be so that maximum likelihood function most Big solution.
3) estimates multiple parameters, and applies maximum likelihood:
In above formula,Thus in the formula in also can be obtained the 2)
It is last constantly to execute step 2) again and 3) until all characteristic point positions are motionless.
The present invention also aims to provide the industry analysis system based on focused crawler technology, including:
Data acquisition module obtains target for carrying out information scratching to targeted website using focused web crawler technology The structuring of industry and non-structured data information;
Data preprocessing module, for carrying out page info parsing, data cleansing and interior to the data information that captures Hold extraction, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful letter Breath;
Data analysis module extracts target information, respectively for using text classification and clustering method from useful information Formed comprising industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, Industry market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry The subject dataset of competition situation theme, industry production capacity distribution map theme;
Data application module, the content for by way of document and or chart, each subject data to be concentrated carry out The visualization of information multidimensional shows, and forms industry analysis report.
Wherein, the data application module further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user, Then the theme that corresponding theme is concentrated is exported according to retrieval information by system, the form by picture or document can be passed through To show user to use.
Industry analysis system of the offer based on focused crawler technology further includes being distributed formula structure/non-structural data Library can support knot by building a big data cloud storage platform with high reliability and good autgmentability based on Hadoop Structure data, the distributed storage of semi-structured data and unstructured data and parallel computation are realized PB grades multi-source heterogeneous big The batch and Stream Processing of data clean the data of separate sources by Hadoop interactive modules, format judgement, are uploaded to HDFS is stored.
The operation principle and procedure declaration of the above-described industry analysis system based on focused crawler technology, are asked in detail See the workflow about the industry analysis method based on focused crawler technology above, it is no longer detailed to the operation principle of the system It describes in detail bright.
The above is only a preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications Also it should be regarded as protection scope of the present invention.

Claims (7)

1. the industry analysis method based on focused crawler technology, which is characterized in that include the following steps:
Information scratching is carried out to targeted website using focused web crawler technology, obtains the structuring of target industry and non-structural The data information of change;
Page info parsing, data cleansing and contents extraction are carried out to the data information captured, duplicate message is gone Weight, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side, isolates useful information;
Using text classification and clustering algorithm, target information is extracted from useful information, is respectively formed comprising industry laws policy Theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry market capacity theme, industry State of the art theme, industry enter exiting obstacles theme, industry life cycle theme, industry competitive situation theme, industry production capacity The subject dataset of distribution map theme;
By way of document and or chart, the content that each subject data is concentrated is showed into the visualization of row information multidimensional, shape At industry analysis report.
2. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the focused web is climbed Worm technology to targeted website carry out information scratching the step of be:
It sets targeted website to seed website, the first generation content of pages that seed website crawls is extracted, it is collected In include all sublinks;
According to the degree of correlation of web page contents and theme to be determined in theme feature vector determination sublink, waited for according to degree of correlation screening Link is accessed, url list to be visited is added in the link that relevance degree is reached to given threshold, and the degree of correlation is less than given threshold Link filters out;
Start Theme Crawler of Content, executing data to the link in url list to be visited successively crawls.
3. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that described according to theme spy The degree of correlation for levying web page contents and theme to be determined in vector determination sublink, is by theme feature vector and webpage to be determined The included angle cosine of corresponding feature vector is weighed.
4. the industry analysis method based on focused crawler technology as claimed in claim 2, which is characterized in that the theme feature to Amount is calculated by TF-IDF algorithms, and steps are as follows:
One and the relevant sample site measure page of theme are first chosen, each entry in scheduled keyword set is considered as a feature , as a base unit of sample site measure webpage, each characteristic item in sample site measure webpage is calculated by TF-IDF algorithms The characteristic item of the website and webpage is converted to n-dimensional space vector, one characteristic item of every one-dimensional correspondence of n-dimensional space vector, generation by weights The weights of table character pair item in the web page, vector space W (d)=(w is converted by n characteristic item1, w2, w3..., wn), wi Represent weights of the ith feature item in sample site measure page d, this vector space, that is, theme feature vector.
5. the industry analysis method based on focused crawler technology as described in claim 1, which is characterized in that the text classification Gauss hybrid models are used with clustering algorithm.
6. the industry analysis system based on focused crawler technology, which is characterized in that including:
Data acquisition module obtains target industry for carrying out information scratching to targeted website using focused web crawler technology Structuring and non-structured data information;
Data preprocessing module is carried for carrying out page info parsing, data cleansing and content to the data information captured It takes, duplicate removal is carried out to duplicate message, this participle, feature extraction and the keyword extraction of composing a piece of writing of going forward side by side isolate useful information;
Data analysis module extracts target information from useful information, is respectively formed for using text classification and clustering method Including industry laws policy theme, industry development situation theme, industry development environment theme, the horizontal theme of industry development, industry Market capacity theme, industry technology present situation theme, industry enter exiting obstacles theme, industry life cycle theme, industry competition The subject dataset of pattern theme, industry production capacity distribution map theme;
Data application module, for by way of document and or chart, content that each subject data is concentrated is into row information Multidimensional visualization shows, and forms industry analysis report.
7. the industry analysis system based on focused crawler technology as claimed in claim 6, which is characterized in that the data application mould Block further includes Information retrieval queries unit, and retrieval and inquisition information is inputted for user.
CN201810088951.7A 2018-01-30 2018-01-30 Industry analysis method and system based on focused crawler technology Pending CN108334591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810088951.7A CN108334591A (en) 2018-01-30 2018-01-30 Industry analysis method and system based on focused crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810088951.7A CN108334591A (en) 2018-01-30 2018-01-30 Industry analysis method and system based on focused crawler technology

Publications (1)

Publication Number Publication Date
CN108334591A true CN108334591A (en) 2018-07-27

Family

ID=62926710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810088951.7A Pending CN108334591A (en) 2018-01-30 2018-01-30 Industry analysis method and system based on focused crawler technology

Country Status (1)

Country Link
CN (1) CN108334591A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165349A (en) * 2018-08-22 2019-01-08 南京涌亿思信息技术有限公司 Securities data monitoring method, apparatus and system
CN109325860A (en) * 2018-08-29 2019-02-12 中国科学院自动化研究所 Network public-opinion detection method and system for overseas investment Risk-warning
CN109493265A (en) * 2018-11-05 2019-03-19 北京奥法科技有限公司 A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN109522359A (en) * 2018-11-02 2019-03-26 大连瀚闻资讯有限公司 Visualization industrial analysis method based on big data
CN109597928A (en) * 2018-12-05 2019-04-09 云南电网有限责任公司信息中心 Support the non-structured text acquisition methods based on Web network of subscriber policy configuration
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109740102A (en) * 2019-01-24 2019-05-10 国家体育总局体育科学研究所 A kind of body-building system and method based on knowledge of keeping fit library
CN110009128A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Industry public opinion index prediction technique, device, computer equipment and storage medium
CN110020092A (en) * 2018-11-20 2019-07-16 皮商云集(厦门)科技有限公司 Leather industry data center systems based on web crawlers
CN110189170A (en) * 2019-05-27 2019-08-30 中译语通科技股份有限公司 Market sentiment analysis method and system
CN110297961A (en) * 2019-06-26 2019-10-01 广州博士信息技术研究院有限公司 A kind of Quick Acquisition of policy information and optimization extracting method
CN110400101A (en) * 2019-08-21 2019-11-01 苏州经贸职业技术学院 Industry reports analysis system and method
CN110413861A (en) * 2019-07-23 2019-11-05 中南民族大学 Link extracting method, device, equipment and storage medium based on web crawlers
CN110704403A (en) * 2019-08-27 2020-01-17 北京国联视讯信息技术股份有限公司 Data acquisition and analysis system and method based on cloud computing
CN110837595A (en) * 2019-11-05 2020-02-25 北京市燃气集团有限责任公司 Enterprise information data processing method, system, terminal and storage medium
CN110851562A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Information acquisition method, system, equipment and storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN110992168A (en) * 2019-11-29 2020-04-10 交通银行股份有限公司 Bank internal and external data fusion method and system
CN111027318A (en) * 2019-10-12 2020-04-17 中国平安财产保险股份有限公司 Industry classification method, device, equipment and storage medium based on big data
CN111476030A (en) * 2020-05-08 2020-07-31 中国科学院计算机网络信息中心 Prospective factor screening method based on deep learning
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112231535A (en) * 2020-10-23 2021-01-15 山东科技大学 Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium
CN113312343A (en) * 2021-06-11 2021-08-27 北京思特奇信息技术股份有限公司 Business opportunity management method and system based on web crawler tool
CN113313407A (en) * 2021-06-16 2021-08-27 上海交通大学 Enterprise power utilization behavior identification method and device
CN113726900A (en) * 2021-09-02 2021-11-30 四川启睿克科技有限公司 System for judging age bracket of user child

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105550365A (en) * 2016-01-15 2016-05-04 中国科学院自动化研究所 Visualization analysis system based on text topic model
CN106339378A (en) * 2015-07-07 2017-01-18 中国科学院信息工程研究所 Data collecting method based on keyword oriented topic web crawlers
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN106339378A (en) * 2015-07-07 2017-01-18 中国科学院信息工程研究所 Data collecting method based on keyword oriented topic web crawlers
CN105550365A (en) * 2016-01-15 2016-05-04 中国科学院自动化研究所 Visualization analysis system based on text topic model
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165349A (en) * 2018-08-22 2019-01-08 南京涌亿思信息技术有限公司 Securities data monitoring method, apparatus and system
CN109325860A (en) * 2018-08-29 2019-02-12 中国科学院自动化研究所 Network public-opinion detection method and system for overseas investment Risk-warning
CN109522359A (en) * 2018-11-02 2019-03-26 大连瀚闻资讯有限公司 Visualization industrial analysis method based on big data
CN109493265A (en) * 2018-11-05 2019-03-19 北京奥法科技有限公司 A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN110020092A (en) * 2018-11-20 2019-07-16 皮商云集(厦门)科技有限公司 Leather industry data center systems based on web crawlers
CN109597928A (en) * 2018-12-05 2019-04-09 云南电网有限责任公司信息中心 Support the non-structured text acquisition methods based on Web network of subscriber policy configuration
CN109597928B (en) * 2018-12-05 2022-12-16 云南电网有限责任公司信息中心 Unstructured text acquisition method supporting user policy configuration and based on Web network
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109684480B (en) * 2018-12-30 2021-01-05 北京人民在线网络有限公司 Industry-based clustering method
CN109740102A (en) * 2019-01-24 2019-05-10 国家体育总局体育科学研究所 A kind of body-building system and method based on knowledge of keeping fit library
CN110009128A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Industry public opinion index prediction technique, device, computer equipment and storage medium
CN110189170A (en) * 2019-05-27 2019-08-30 中译语通科技股份有限公司 Market sentiment analysis method and system
CN110297961A (en) * 2019-06-26 2019-10-01 广州博士信息技术研究院有限公司 A kind of Quick Acquisition of policy information and optimization extracting method
CN110413861A (en) * 2019-07-23 2019-11-05 中南民族大学 Link extracting method, device, equipment and storage medium based on web crawlers
CN110413861B (en) * 2019-07-23 2021-10-22 中南民族大学 Link extraction method, device, equipment and storage medium based on web crawler
CN110851562A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Information acquisition method, system, equipment and storage medium
CN110852085A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Hotspot topic mining method and system
CN110400101A (en) * 2019-08-21 2019-11-01 苏州经贸职业技术学院 Industry reports analysis system and method
CN110704403A (en) * 2019-08-27 2020-01-17 北京国联视讯信息技术股份有限公司 Data acquisition and analysis system and method based on cloud computing
CN111027318A (en) * 2019-10-12 2020-04-17 中国平安财产保险股份有限公司 Industry classification method, device, equipment and storage medium based on big data
CN110837595A (en) * 2019-11-05 2020-02-25 北京市燃气集团有限责任公司 Enterprise information data processing method, system, terminal and storage medium
CN110992168A (en) * 2019-11-29 2020-04-10 交通银行股份有限公司 Bank internal and external data fusion method and system
CN111476030A (en) * 2020-05-08 2020-07-31 中国科学院计算机网络信息中心 Prospective factor screening method based on deep learning
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler
CN111767482B (en) * 2020-05-21 2023-06-06 中国地质大学(武汉) Self-adaptive crawling method for focused web crawlers
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112231535A (en) * 2020-10-23 2021-01-15 山东科技大学 Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium
CN112231535B (en) * 2020-10-23 2022-11-15 山东科技大学 Method for making multi-modal data set in field of agricultural diseases and insect pests, processing device and storage medium
CN113312343A (en) * 2021-06-11 2021-08-27 北京思特奇信息技术股份有限公司 Business opportunity management method and system based on web crawler tool
CN113313407A (en) * 2021-06-16 2021-08-27 上海交通大学 Enterprise power utilization behavior identification method and device
CN113726900A (en) * 2021-09-02 2021-11-30 四川启睿克科技有限公司 System for judging age bracket of user child

Similar Documents

Publication Publication Date Title
CN108334591A (en) Industry analysis method and system based on focused crawler technology
Wu et al. DeepDetect: A cascaded region-based densely connected network for seismic event detection
Yu et al. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering
CN111737495A (en) Middle-high-end talent intelligent recommendation system and method based on domain self-classification
CN106951498A (en) Text clustering method
Hordri et al. A systematic literature review on features of deep learning in big data analytics
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN109543067A (en) Enterprise's production status based on artificial intelligence monitors analysis system in real time
CN107291895B (en) Quick hierarchical document query method
CN110543595A (en) in-station search system and method
Sarwar et al. A survey of big data analytics in healthcare
Sugiharti et al. Predictive evaluation of performance of computer science students of unnes using data mining based on naÏve bayes classifier (NBC) algorithm
CN113688635A (en) Semantic similarity based class case recommendation method
CN106570196B (en) Video program searching method and device
Kiran et al. Prediction analysis of crime in India using a hybrid clustering approach
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
CN114707068A (en) Method, device, equipment and medium for recommending intelligence base knowledge
Utama et al. SCIENTIFIC ARTICLES RECOMMENDATION SYSTEM BASED ON USER’S RELATEDNESS USING ITEM-BASED COLLABORATIVE FILTERING METHOD
Alhassan et al. Using data mining technique for scholarship disbursement
Alhaj et al. Predicting user entries by using data mining algorithms
CN113378023A (en) Visual system for mining and comparing public opinion and news information of people
CN106919700A (en) Semantics-driven crime clue real-time recommendation method based on parallelization CEP treatment
Revathy et al. Classifying Agricultural Crop PestData Using Hadoop MapReduceBased C5. 0 Algorithm.
Boddu ELIMINATE THE NOISY DATA FROM WEB PAGES USING DATA MINING TECHNIQUES.
Bhandari et al. Enhanced Apriori Algorithm model in course suggestion system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180727