CN106126705A - A kind of large scale network data crawl system in real time - Google Patents
A kind of large scale network data crawl system in real time Download PDFInfo
- Publication number
- CN106126705A CN106126705A CN201610507120.XA CN201610507120A CN106126705A CN 106126705 A CN106126705 A CN 106126705A CN 201610507120 A CN201610507120 A CN 201610507120A CN 106126705 A CN106126705 A CN 106126705A
- Authority
- CN
- China
- Prior art keywords
- degree
- information
- page
- module
- large scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
A kind of large scale network data crawl system in real time, and initialization seed optimizes module, for the kind sublink of typing website;Periodically qualified web page interlinkage is joined in seed set, as the set of initial seed;Integrate module, for the web document of HTML is obtained, and the information in text is labeled;Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage;Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, represents the information of degree of association with concrete numerical information;Hyperlink importance degree computing module, the numerical information being used for calculating, as the foundation judging degree of association, is also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource has met preset requirement.
Description
Technical field
The present invention relates to big data field of cloud computer technology, crawl in real time particularly to a kind of large scale network data and be
System.
Background technology
Developing rapidly and becoming increasingly popular along with the Internet, the content that network information platform can be provided by is abundant all the more many
Coloured silk, user search for information needed time face search difficulty increase and information sifting needed for consume plenty of time and energy also with
?.The appearance of search engine solves a difficult problem for magnanimity information retrieval.Search engine carries out the collection of resource by reptile.
Web crawlers carries out crawling and collecting of web document by network connection, i.e. starts with from previously given URL, utilizes H1vrP
Agreement crawls required html document, and analyzes the hyperlink included in these html documents, again captures the chain not accessed
The resource connect and comprise.So repeatedly until there is no new URL.
But due to the fast development of mobile Internet, the newest web page contents presents explosive growth, and traditional climbs
Take system and cannot meet the demand that large scale network data crawl.
Summary of the invention
Therefore, it is necessary to provide a kind of can crawl in real time large scale network data large scale network data climb in real time
Take system.
A kind of large scale network data crawl system in real time, and it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result
Feed back to user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed
In set, as the set of initial seed;
Integrate module, for the web document of HTML is obtained, and the information in text is labeled;
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2
The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high,
If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical;
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter
Breath represents the information of degree of association;
Hyperlink importance degree computing module, for depending on the numerical information calculated as one that judges degree of association
According to, also it is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value,
Represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource meets
Preset requirement.
Crawl in real time in system in large scale network data of the present invention, described web pages relevance computing module bag
Include:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching,
Condition is to find corresponding marker character, and marker character is defined as delimiters;Search function interception position 1, search function defines
For Find ();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character
String is defined as dest;Traversal terminates, output string;
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and by pure
Text classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;In the acquisition of information theme phase excavated
The algorithm vector space model of Guan Du represents.
Crawl in real time in system in large scale network data of the present invention,
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n,
Number of times key word occur is added up, key word localization criteria the highest for the frequency of occurrences, here frequency
It is defined as xi, build a vector xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2,
... n,;Then two vectorial cosine functions just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector
The biggest, represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, illustrates with main
The degree of association of topic is higher;
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise with theme not phase
Close, classification is carried out for the webpage relevant to theme and preserves, be submitted to Database index data.
Crawl in real time in system in large scale network data of the present invention, current web page and degree of subject relativity are set
Threshold value includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage
Property, and calculate accuracy rate;
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops
Low threshold is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then
Improve threshold value for improving the accuracy rate of reptile;
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
Crawling in real time in system in large scale network data of the present invention, hyperlink importance degree computing module includes:
As follows to the computing formula of page importance degree:
pu=w1*cos<α,β>+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL (u) represents the connection number location searched, its maximum CmaxRepresent;The weights of page degree of association
Representing with m1, the weights m2 of page link degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
Crawling in real time in system in large scale network data of the present invention, described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to
Domain name of creeping can be run, be not belonging to run domain name of creeping and directly terminate, belong to and can run domain name of creeping and then obtain the territory of URL
Name, and obtain the video parsing class corresponding with this domain name;Judge that video resolves whether class is empty, then terminates for sky, be not empty
Continue to determine whether the broadcast address for video HTML, be not that broadcast address then terminates, be that broadcast address is then from URL and content
Obtain video true download address list, when video true download address list is not empty, return the true download address of video
List also terminates;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration
Module.
The large scale network data that implementing the present invention provides crawl system in real time and compared with prior art have following useful
Effect: analyze degree of subject relativity with concrete numerical value by arranging web pages relevance computing module, with concrete numerical information
Represent the information of degree of association;By hyperlink importance degree computing module using the numerical information calculated as judging degree of association
A foundation, be also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached one
Fixed numerical value, represents that this page has several to link, if the number comprised has reached the master that default value represents comprised
Topic resource has met preset requirement, it is possible to obtain the network data wanted in the big data of magnanimity, and by arranging integration mould
The web document of HTML is obtained and analyses whether as video HTML by block, it is possible to distinguishes generic web page and video web-pages, is
Crawl in hgher efficiency.
Accompanying drawing explanation
Fig. 1 is that the large scale network data of the embodiment of the present invention crawl system architecture diagram in real time.
Fig. 2 is the structured flowchart of web pages relevance computing module in Fig. 1.
Detailed description of the invention
As shown in Figure 1, 2, a kind of large scale network data crawl system in real time, and it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result
Feed back to user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed
In set, as the set of initial seed.
Alternatively, initialization seed optimizes in module, arranges greatest priority queue, is safeguarded in greatest priority queue
Set set in, corresponding priority key of each element in set.By the maximum priority queue following flow process of support:
Insert queue Insert (set, e, key): be inserted in set by the element e that priority is key;
Highest queue Max (set): return the element that set set medium priority is the highest;
Extract queue Ext (set): return the element that in set set, priority is the highest, and it deleted from set;
It is incremented by queue (set, e, key): the priority of element e in set set is set to key.
By the present embodiment, can be realized by raft, there is the highest efficiency.
Integrate module, for the web document of HTML is obtained, and the information in text is labeled.
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2
The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high,
If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical.
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter
Breath represents the information of degree of association.
Hyperlink importance degree computing module, for depending on the numerical information calculated as one that judges degree of association
According to, also it is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value,
Represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource meets
Preset requirement.
Crawl in real time in system in large scale network data of the present invention, described web pages relevance computing module bag
Include:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching,
Condition is to find corresponding marker character, and marker character is defined as delimiters;Search function interception position 1, search function defines
For Find ();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character
String is defined as dest;Traversal terminates, output string.
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and by pure
Text classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;In the acquisition of information theme phase excavated
The algorithm vector space model of Guan Du represents.
Crawl in real time in system in large scale network data of the present invention,
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n,
Number of times key word occur is added up, key word localization criteria the highest for the frequency of occurrences, here frequency
It is defined as xi, build a vector xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2,
... n,;Then two vectorial cosine functions just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector
The biggest, represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, illustrates with main
The degree of association of topic is higher.
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise with theme not phase
Close, classification is carried out for the webpage relevant to theme and preserves, be submitted to Database index data.
Crawl in real time in system in large scale network data of the present invention, current web page and degree of subject relativity are set
Threshold value includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage
Property, and calculate accuracy rate.
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops
Low threshold is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then
Improve threshold value for improving the accuracy rate of reptile.
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
Crawling in real time in system in large scale network data of the present invention, hyperlink importance degree computing module includes:
As follows to the computing formula of page importance degree:
pu=w1*cos<α,β>+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL (u) represents the connection number location searched, its maximum CmaxRepresent;The weights of page degree of association
Representing with m1, the weights m2 of page link degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
Crawling in real time in system in large scale network data of the present invention, described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to
Domain name of creeping can be run, be not belonging to run domain name of creeping and directly terminate, belong to and can run domain name of creeping and then obtain the territory of URL
Name, and obtain the video parsing class corresponding with this domain name;Judge that video resolves whether class is empty, then terminates for sky, be not empty
Continue to determine whether the broadcast address for video HTML, be not that broadcast address then terminates, be that broadcast address is then from URL and content
Obtain video true download address list, when video true download address list is not empty, return the true download address of video
List also terminates;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration
Module.
The large scale network data that implementing the present invention provides crawl system in real time and compared with prior art have following useful
Effect: analyze degree of subject relativity with concrete numerical value by arranging web pages relevance computing module, with concrete numerical information
Represent the information of degree of association;By hyperlink importance degree computing module using the numerical information calculated as judging degree of association
A foundation, be also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached one
Fixed numerical value, represents that this page has several to link, if the number comprised has reached the master that default value represents comprised
Topic resource has met preset requirement, it is possible to obtain the network data wanted in the big data of magnanimity, and by arranging integration mould
The web document of HTML is obtained and analyses whether as video HTML by block, it is possible to distinguishes generic web page and video web-pages, is
Crawl in hgher efficiency.
It is understood that for the person of ordinary skill of the art, can conceive according to the technology of the present invention and do
Go out other various corresponding changes and deformation, and all these change all should belong to the protection model of the claims in the present invention with deformation
Enclose.
Claims (6)
1. large scale network data crawl system in real time, it is characterised in that it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result feedback
To user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed set
In, as the set of initial seed;
Integrate module, for the web document of HTML is obtained, and the information in text is labeled;
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if there being 2 pages
Face A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, simultaneously
If being simultaneously directed to A and B2 hyperlink when of user's Query Information, then the information quality being defaulted as A with B is identical;
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, comes with concrete numerical information
Represent the information of degree of association;
Hyperlink importance degree computing module, the numerical information being used for calculating is as the foundation judging degree of association, also
It is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, represent
This page has several to link, if the number comprised has reached default value, to represent that comprised subject resource has met pre-
If requirement.
2. large scale network data as claimed in claim 1 crawl system in real time, it is characterised in that described web pages relevance meter
Calculation module includes:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching, condition
Being to find corresponding marker character, marker character is defined as delimiters;Search function interception position 1, search function is defined as
Find();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character string
It is defined as dest;Traversal terminates, output string;
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and passes through plain text
Classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;At the acquisition of information degree of subject relativity excavated
Algorithm vector space model represent.
3. large scale network data as claimed in claim 2 crawl system in real time, it is characterised in that
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n, key word is gone out
Existing number of times is added up, and key word localization criteria the highest for the frequency of occurrences, here frequency is defined as xi, build one to
Amount xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2 ... n,;Then two vectorial cosine letters
Number just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector is the biggest,
Represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, and the phase with theme is described
Close Du Genggao;
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise uncorrelated with theme, right
Carry out classification in the webpage relevant to theme to preserve, be submitted to Database index data.
4. large scale network data as claimed in claim 3 crawl system in real time, it is characterised in that arrange current web page and master
The threshold value of topic degree of association includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and by the dependency of manual analysis webpage, and
Calculate accuracy rate;
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then reduces threshold
Value is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then improve
Threshold value is for improving the accuracy rate of reptile;
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
5. large scale network data as claimed in claim 4 crawl system in real time, it is characterised in that hyperlink importance degree calculates
Module includes:
As follows to the computing formula of page importance degree:
pu=w1* cos < α, β >+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL
U () represents the connection number location searched, its maximum CmaxRepresent;The weights m1 of page degree of association represents, page link
The weights m2 of degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
6. large scale network data as claimed in claim 5 crawl system in real time, and described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to and can transport
Row is creeped domain name, is not belonging to run domain name of creeping and directly terminates, and belongs to and can run domain name of creeping and then obtain the domain name of URL, and
Obtain the video corresponding with this domain name and resolve class;Judge that video resolves whether class is empty, then terminates for sky, does not continues to sentence for sky
Whether disconnected be the broadcast address of video HTML, is not that broadcast address then terminates, is that broadcast address is then regarded from URL and content
The true download address list of frequency, when video true download address list is not empty, returns video true download address list also
Terminate;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610507120.XA CN106126705A (en) | 2016-07-01 | 2016-07-01 | A kind of large scale network data crawl system in real time |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610507120.XA CN106126705A (en) | 2016-07-01 | 2016-07-01 | A kind of large scale network data crawl system in real time |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106126705A true CN106126705A (en) | 2016-11-16 |
Family
ID=57467666
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610507120.XA Pending CN106126705A (en) | 2016-07-01 | 2016-07-01 | A kind of large scale network data crawl system in real time |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126705A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480275A (en) * | 2017-08-21 | 2017-12-15 | 成都西维数码科技有限公司 | A kind of harmful information monitoring method and system based on big data |
CN108052517A (en) * | 2017-10-19 | 2018-05-18 | 福建中金在线信息科技有限公司 | Data search method and system |
CN108710672A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of Theme Crawler of Content method based on increment bayesian algorithm |
CN108985068A (en) * | 2018-06-26 | 2018-12-11 | 广东电网有限责任公司信息中心 | Loophole quick sensing, positioning and the method and system of verifying |
CN109446396A (en) * | 2018-10-17 | 2019-03-08 | 珠海市智图数研信息技术有限公司 | A kind of intelligent crawler frame system of line business information |
CN109739848A (en) * | 2018-12-28 | 2019-05-10 | 杭州铭智云教育科技有限公司 | A kind of data extraction method |
CN109948019A (en) * | 2019-01-10 | 2019-06-28 | 中央财经大学 | A kind of deep layer Network Data Capture method |
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186676A (en) * | 2013-04-08 | 2013-07-03 | 湖南农业大学 | Method for searching thematic knowledge self growth form focused crawlers |
-
2016
- 2016-07-01 CN CN201610507120.XA patent/CN106126705A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186676A (en) * | 2013-04-08 | 2013-07-03 | 湖南农业大学 | Method for searching thematic knowledge self growth form focused crawlers |
Non-Patent Citations (2)
Title |
---|
王艳阁: "主题微博爬虫的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
邱荷花: "基于Hadhoop的视频爬虫系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480275A (en) * | 2017-08-21 | 2017-12-15 | 成都西维数码科技有限公司 | A kind of harmful information monitoring method and system based on big data |
CN108052517A (en) * | 2017-10-19 | 2018-05-18 | 福建中金在线信息科技有限公司 | Data search method and system |
CN108710672A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of Theme Crawler of Content method based on increment bayesian algorithm |
CN108710672B (en) * | 2018-05-17 | 2020-04-14 | 南京大学 | Theme crawler method based on incremental Bayesian algorithm |
CN108985068A (en) * | 2018-06-26 | 2018-12-11 | 广东电网有限责任公司信息中心 | Loophole quick sensing, positioning and the method and system of verifying |
CN109446396A (en) * | 2018-10-17 | 2019-03-08 | 珠海市智图数研信息技术有限公司 | A kind of intelligent crawler frame system of line business information |
CN109739848A (en) * | 2018-12-28 | 2019-05-10 | 杭州铭智云教育科技有限公司 | A kind of data extraction method |
CN109739848B (en) * | 2018-12-28 | 2021-11-09 | 深圳市科联汇通科技有限公司 | Data extraction method |
CN109948019A (en) * | 2019-01-10 | 2019-06-28 | 中央财经大学 | A kind of deep layer Network Data Capture method |
CN109948019B (en) * | 2019-01-10 | 2021-10-08 | 中央财经大学 | Deep network data acquisition method |
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
CN111767482B (en) * | 2020-05-21 | 2023-06-06 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawlers |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126705A (en) | A kind of large scale network data crawl system in real time | |
CN101894170B (en) | Semantic relationship network-based cross-mode information retrieval method | |
US7640488B2 (en) | System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages | |
CN101256596B (en) | Method and system for instation guidance | |
Abebe et al. | Generic metadata representation framework for social-based event detection, description, and linkage | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN106484764A (en) | User's similarity calculating method based on crowd portrayal technology | |
CN104008203A (en) | User interest discovering method with ontology situation blended in | |
CN105045901A (en) | Search keyword push method and device | |
CN103678412A (en) | Document retrieval method and device | |
CN102880723A (en) | Searching method and system for identifying user retrieval intention | |
CN105159930A (en) | Search keyword pushing method and apparatus | |
CN102402566A (en) | Web user behavior analysis method based on Chinese webpage automatic classification technology | |
CN104598607A (en) | Method and system for recommending search phrase | |
CN109271514A (en) | Generation method, classification method, device and the storage medium of short text disaggregated model | |
CN106484797A (en) | Accident summary abstracting method based on sparse study | |
CN105939359A (en) | Method and device for detecting privacy leakage of mobile terminal | |
CN107239512A (en) | The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination | |
Ye et al. | A web services classification method based on GCN | |
CN104281619A (en) | System and method for ordering search results | |
CN104346382A (en) | Text analysis system and method employing language query | |
KR100917458B1 (en) | Method and system of providing recommended words | |
Geng et al. | Research on improved focused crawler and its application in food safety public opinion analysis | |
CN104462241A (en) | Population property classification method and device based on anchor texts and peripheral texts in URLs | |
Yang et al. | A topic-specific web crawler with web page hierarchy based on HTML Dom-Tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161116 |