CN106528802A - Data collecting method and device - Google Patents
Data collecting method and device Download PDFInfo
- Publication number
- CN106528802A CN106528802A CN201610998106.4A CN201610998106A CN106528802A CN 106528802 A CN106528802 A CN 106528802A CN 201610998106 A CN201610998106 A CN 201610998106A CN 106528802 A CN106528802 A CN 106528802A
- Authority
- CN
- China
- Prior art keywords
- target
- gathered data
- data
- gathered
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention discloses a data collecting method and device. The data collecting method comprises the following steps that a target theme and a target theme collecting website are determined; target webpage links corresponding to the target theme are determined in a plurality of webpage links included in the target theme collecting website; content in a webpage corresponding to each target webpage link is collected, and a plurality of pieces of collected data are obtained; a result data set is determined according to the matching degree of the target theme and each piece of collected data. According to the technical scheme, the target webpage links corresponding to the target theme are determined in a targeted mode, so that less content is collected from the webpage corresponding to each target webpage link, correlation with the target theme is large, and the precision of data collection and data value density are improved.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of collecting method and device.
Background technology
With the fast development of Internet technology, the application of big data is more and more.Under big data scene, data acquisition
Demand gradually increase.
In the prior art, when the data of certain theme are needed, obtained from the Internet by non-directional reptile mostly
Mass data, then based on the mass data for getting, by complicated Data Matching algorithm, filters out related to theme
Data.
This method haves the shortcomings that certain, and the data volume of basic data is too big, and non-relevant data accounting is higher, often very
Hardly possible correctly picks out the data closely related with theme, and precision is relatively low.In the big data epoch, the data value density of presentation compared with
It is low.
The content of the invention
It is an object of the invention to provide a kind of collecting method and device, to improve the precision and data of data acquisition
Value density.
To solve above-mentioned technical problem, the present invention provides following technical scheme:
A kind of collecting method, including:
Determine target topic and target collection website;
In multiple web page interlinkages that the target gathers that website includes, the corresponding target web of the target topic is determined
Link;
The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained;
According to the target topic and the matching degree of every gathered data, result data set is determined.
In a kind of specific embodiment of the present invention, the corresponding target web link of the target topic is determined described
Afterwards, before described each target web of collection links the content on corresponding webpage, also include:
The corresponding target web link of the target topic to determining carries out filtration treatment.
In a kind of specific embodiment of the present invention, the determination target topic and target gather website, including:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, the matching according to the target topic and every gathered data
Degree, determines result data set, including:
Determine the key word of every gathered data;
Determine the text similarity of the target topic and the key word of every gathered data;
For every gathered data, if the target topic is high with the text similarity of the key word of the gathered data
In preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, the key word for determining every gathered data, including:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data
Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
A kind of data acquisition unit, including:
Target determination module, for determining target topic and target collection website;
Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target
The corresponding target web link of theme;
Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of
Gathered data;
Result data determining module, for the matching degree according to the target topic and every gathered data, it is determined that knot
Fruit data acquisition system.
In a kind of specific embodiment of the present invention, also include:
Link filter module, for it is described determine the corresponding target web link of the target topic after, described adopt
Before collecting the content that each target web is linked on corresponding webpage, to the corresponding target web chain of the target topic for determining
Tap into row filtration treatment.
In a kind of specific embodiment of the present invention, the target determination module, specifically for:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, the result data determining module, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining the text of the target topic and the key word of every gathered data
Similarity;
Result data determination sub-module, for for every gathered data, if the target topic and the collection number
According to key word text similarity be higher than preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, the key word determination sub-module, specifically for:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data
Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
The technical scheme provided using the embodiment of the present invention, after it is determined that target topic and target gather website, in mesh
In multiple web page interlinkages that mark collection website includes, the corresponding target web link of target topic is determined, each target is gathered
Content in the corresponding webpage of web page interlinkage, obtains a plurality of gathered data, according to matching for target topic and every gathered data
Degree, it may be determined that result data set.Orientation determines the corresponding target web link of target topic so that from each target
The content collected in the corresponding webpage of web page interlinkage is less, larger with the dependency of target topic, improves data acquisition
Precision and data value density.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of implementing procedure figure of collecting method in the embodiment of the present invention;
Fig. 2 is a kind of structural representation of data acquisition unit in the embodiment of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description
The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise
Lower obtained every other embodiment, belongs to the scope of protection of the invention.
A kind of collecting method is embodiments provided, the method can apply to search engine and provide the user
In the application scenarios of retrieval service.Search engine refers to information of collecting from the Internet, after information is organized and is processed,
Retrieval service is provided the user, the system that the related information of user search is showed into user.
The technical scheme provided by the embodiment of the present invention can intelligently carry out data acquisition, according to the target master for determining
Topic, using the orientation filter capacity of search engine, with reference to secondary content filtering method, can accurately filter out target and adopt
With the closely related content of target topic in collection website.
It is shown in Figure 1, a kind of implementing procedure figure of the collecting method provided by the embodiment of the present invention, the method
May comprise steps of:
S110:Determine target topic and target collection website.
When user has the demand of gathered data, the target topic and target collection net of data to be gathered can be first determined
Stand.
The present invention a kind of specific embodiment in, can according to the key word of user input, determine target topic and
Target gathers website.
In embodiments of the present invention, input interface can be provided the user, user is connect by the input according to self-demand
Mouth can be input into key word.The key word can be any one or more nouns such as enterprise's name, name, event, relation.Can be with
The key word of user input is determined directly as into target topic.
User can also be input into the chained address that target gathers website by the input interface, so as to according to user input
Chained address, it may be determined that target gathers website.
Or, the target topic for determining can be passed through, target collection website is automatically determined.Such as, pre-build substantial amounts of
The corresponding relation of theme and website, when it is determined that after target topic, can find in the corresponding relation for pre-building and target master
Inscribe corresponding target collection website.
The embodiment of the present invention is applied to the data acquisition of arbitrary theme and arbitrary collection website, and versatility is higher.
S120:In multiple web page interlinkages that target gathers that website includes, the corresponding target web chain of target topic is determined
Connect.
In step S110, it is determined that target topic and target collection website.Each website includes multiple web page interlinkages,
Different web pages include different contents in linking corresponding webpage.Target collection website equally includes multiple web page interlinkages.
In multiple web page interlinkages that target gathers that website includes, it may be determined that the corresponding target web chain of target topic
Connect.Specifically, website can be gathered as target with target, filters out a series of target webs related to target topic and link.Mesh
Mark web page interlinkage can have one or more, and the content that each target web link is included is related to target topic.
S130:The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained.
In embodiments of the present invention, can link for each target web, the target network is gathered by non-directional mode
Full content in the corresponding webpage of page link, obtains a plurality of gathered data.
In actual applications, multithreading can be opened, corresponding web page contents is linked to different target webs respectively and is entered
Row collection, it is to avoid resource contention, improves collecting efficiency.
The corresponding target web link of target topic is first determined, then gathers each target web and linked in corresponding webpage
Content so that the amount of content data for collecting is less, reduces the difficulty of subsequent treatment.
The present invention a kind of specific embodiment in, after step S120, before step S130, can also include with
Lower step:
The corresponding target web link of target topic to determining carries out filtration treatment.
After it is determined that the corresponding target web of target topic is linked, can be to the corresponding target network of target topic of determination
Page link carries out filtration treatment.Specifically, the correctness of target web link can be analyzed, picks out correct webpage
Link, deletes web page interlinkage, invalid web pages link of repetition etc..
Further, in step s 130, content of the collection in the corresponding webpage of each web page interlinkage of filtration treatment, with
Improve the efficiency of data acquisition.
S140:According to target topic and the matching degree of every gathered data, result data set is determined.
Target topic is the theme determined according to user's request, and finally data to be obtained should be matched with target topic
More data.
A plurality of gathered data is obtained in step S130, can calculate target topic and every gathered data matches journey
Degree.According to the matching degree of target topic and every gathered data, it may be determined that result data set.
In a kind of specific embodiment of the present invention, step S140 may comprise steps of:
Step one:Determine the key word of every gathered data;
Step 2:Determine the text similarity of target topic and the key word of every gathered data;
Step 3:For every gathered data, if target topic is similar to the text of the key word of the gathered data
Degree is then integrated into the gathered data in result data set higher than preset first threshold value.
For ease of description, above three step is combined and is illustrated.
Every gathered data may be considered and is made up of multiple basic words.For every gathered data, can be from this
The key word of the gathered data is determined in the basic word that bar gathered data is included.
In a kind of specific embodiment of the present invention, above-mentioned steps one may comprise steps of:
First step:For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection number
According to basic word set;
Second step:Determine the frequency that each basic word occurs in the gathered data;
3rd step:Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
For every gathered data, the gathered data is carried out after word segmentation processing, it is possible to obtain the gathered data
The set of basic word.In embodiments of the present invention, basic word is the word with practical significance, such as name, place name, action and action
Object etc., can exclude the function word without practical significance, as " ", " ", " obtaining " etc..
It is understood that the frequency that basic word occurs in gathered data is more, then the basic word can more represent this and adopt
Collection data implication to be expressed.For a basic word of a gathered data, the basic word goes out in the gathered data
Existing frequency is:The frequency that all basic word of the frequency/gathered data that the basic word occurs in the gathered data occurs
It is cumulative and.
For every gathered data, after obtaining the set of basic word of the gathered data, it may be determined that each basic word
Basic word of the frequency higher than default Second Threshold is defined as the gathered data by the frequency occurred in the gathered data
Key word.
Further, it may be determined that the text similarity of target topic and the key word of every gathered data.Specifically, may be used
With the algorithm using prior art Chinese version similarity, the embodiment of the present invention is repeated no more to this.
For every gathered data, if target topic is higher than pre- with the text similarity of the key word of the gathered data
If first threshold, then show that the gathered data with target topic relatively, can be integrated into result by the gathered data
In data acquisition system.
It should be noted that first threshold and Second Threshold can be set according to practical situation and be adjusted, the present invention
Embodiment is without limitation.
The method provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target
In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered
Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data,
Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web
The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition
Degree and data value density.
In addition, the embodiment of the present invention by means of the Millisecond search capability of search engine, orientation can be completed within the several seconds
Acquisition tasks.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of data acquisition unit, is described below
A kind of data acquisition unit can be mutually to should refer to a kind of above-described collecting method.
Shown in Figure 2, the device can be included with lower module:
Target determination module 210, for determining target topic and target collection website;
Link determining module 220, for, in multiple web page interlinkages that target gathers that website includes, determining target topic pair
The target web link answered;
Gathered data obtains module 230, for gathering the content in each corresponding webpage of target web link, obtains many
Bar gathered data;
Result data determining module 240, for according to target topic and the matching degree of every gathered data, determining result
Data acquisition system.
The device provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target
In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered
Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data,
Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web
The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition
Degree and data value density.
In a kind of specific embodiment of the present invention, also include:
Link filter module, for it is determined that after the corresponding target web link of target topic, gathering each target network
Before content on the corresponding webpage of page link, the corresponding target web link of target topic to determining carries out filtration treatment.
In a kind of specific embodiment of the present invention, target determination module 210, specifically for:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, result data determining module 240, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining that target topic is similar to the text of the key word of every gathered data
Degree;
Result data determination sub-module, for for every gathered data, if target topic and the gathered data
The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, key word determination sub-module, specifically for:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data
Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
In this specification, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other
The difference of embodiment, between each embodiment same or similar part mutually referring to.For dress disclosed in embodiment
For putting, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part
Illustrate.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description
And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and
The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These
Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty
Technical staff can use different methods to realize described function to each specific application, but this realization should not
Think beyond the scope of this invention.
The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor
Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Specific case used herein is set forth to the principle and embodiment of the present invention, and above example is said
It is bright to be only intended to help and understand technical scheme and its core concept.It should be pointed out that common for the art
For technical staff, under the premise without departing from the principles of the invention, some improvement and modification can also be carried out to the present invention, these
Improve and modification is also fallen in the protection domain of the claims in the present invention.
Claims (10)
1. a kind of collecting method, it is characterised in that include:
Determine target topic and target collection website;
In multiple web page interlinkages that the target gathers that website includes, the corresponding target web chain of the target topic is determined
Connect;
The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained;
According to the target topic and the matching degree of every gathered data, result data set is determined.
2. collecting method according to claim 1, it is characterised in that determine that the target topic is corresponding described
After target web link, before the content gathered on each corresponding webpage of target web link, also include:
The corresponding target web link of the target topic to determining carries out filtration treatment.
3. collecting method according to claim 1, it is characterised in that the determination target topic and target collection net
Stand, including:
According to the key word of user input, target topic and target collection website are determined.
4. the collecting method according to any one of claims 1 to 3, it is characterised in that described according to the target master
The matching degree with every gathered data is inscribed, result data set is determined, including:
Determine the key word of every gathered data;
Determine the text similarity of the target topic and the key word of every gathered data;
For every gathered data, if the target topic is higher than pre- with the text similarity of the key word of the gathered data
If first threshold, then the gathered data is integrated in result data set.
5. collecting method according to claim 4, it is characterised in that the key of every gathered data of determination
Word, including:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection of the basic word of the gathered data
Close;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
6. a kind of data acquisition unit, it is characterised in that include:
Target determination module, for determining target topic and target collection website;
Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target topic
Corresponding target web link;
Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of collection
Data;
Result data determining module, for according to the target topic and the matching degree of every gathered data, determining number of results
According to set.
7. data acquisition unit according to claim 6, it is characterised in that also include:
Link filter module, for it is described determine the corresponding target web link of the target topic after, the collection it is every
Before individual target web links the content on corresponding webpage, the corresponding target web chain of the target topic to determining is tapped into
Row filtration treatment.
8. data acquisition unit according to claim 6, it is characterised in that the target determination module, specifically for:
According to the key word of user input, target topic and target collection website are determined.
9. the data acquisition unit according to any one of claim 6 to 8, it is characterised in that the result data determines mould
Block, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining that the target topic is similar to the text of the key word of every gathered data
Degree;
Result data determination sub-module, for for every gathered data, if the target topic and the gathered data
The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.
10. data acquisition unit according to claim 9, it is characterised in that the key word determination sub-module, it is concrete to use
In:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection of the basic word of the gathered data
Close;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610998106.4A CN106528802A (en) | 2016-11-11 | 2016-11-11 | Data collecting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610998106.4A CN106528802A (en) | 2016-11-11 | 2016-11-11 | Data collecting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106528802A true CN106528802A (en) | 2017-03-22 |
Family
ID=58351460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610998106.4A Pending CN106528802A (en) | 2016-11-11 | 2016-11-11 | Data collecting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528802A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704535A (en) * | 2017-09-21 | 2018-02-16 | 广州大学 | Info web acquisition methods, apparatus and system based on Topic Similarity |
CN109446425A (en) * | 2018-10-30 | 2019-03-08 | 郑州市景安网络科技股份有限公司 | A kind of network information gathering and dissemination method, system |
CN110297994A (en) * | 2019-06-03 | 2019-10-01 | 北京金蝶管理软件有限公司 | Acquisition method, device, computer equipment and the storage medium of web data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN101630327A (en) * | 2009-08-14 | 2010-01-20 | 昆明理工大学 | Design method of theme network crawler system |
CN103714140A (en) * | 2013-12-23 | 2014-04-09 | 北京锐安科技有限公司 | Searching method and device based on topic-focused web crawler |
CN104978408A (en) * | 2015-08-05 | 2015-10-14 | 许昌学院 | Berkeley DB database based topic crawler system |
CN105653546A (en) * | 2014-11-11 | 2016-06-08 | 北大方正集团有限公司 | Method and system for searching target theme |
-
2016
- 2016-11-11 CN CN201610998106.4A patent/CN106528802A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN101630327A (en) * | 2009-08-14 | 2010-01-20 | 昆明理工大学 | Design method of theme network crawler system |
CN103714140A (en) * | 2013-12-23 | 2014-04-09 | 北京锐安科技有限公司 | Searching method and device based on topic-focused web crawler |
CN105653546A (en) * | 2014-11-11 | 2016-06-08 | 北大方正集团有限公司 | Method and system for searching target theme |
CN104978408A (en) * | 2015-08-05 | 2015-10-14 | 许昌学院 | Berkeley DB database based topic crawler system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704535A (en) * | 2017-09-21 | 2018-02-16 | 广州大学 | Info web acquisition methods, apparatus and system based on Topic Similarity |
CN109446425A (en) * | 2018-10-30 | 2019-03-08 | 郑州市景安网络科技股份有限公司 | A kind of network information gathering and dissemination method, system |
CN110297994A (en) * | 2019-06-03 | 2019-10-01 | 北京金蝶管理软件有限公司 | Acquisition method, device, computer equipment and the storage medium of web data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106503014B (en) | Real-time information recommendation method, device and system | |
CN103870461B (en) | Subject recommending method, device and server | |
KR102080362B1 (en) | Query expansion | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
CN103500213B (en) | Page hot-spot resource updating method and device based on pre-reading | |
CN102831193A (en) | Topic detecting device and topic detecting method based on distributed multistage cluster | |
CN103116638B (en) | Webpage screening method and device thereof | |
CN106021418B (en) | The clustering method and device of media event | |
CN103984757B (en) | Search results pages is inserted the method and system of news information entry | |
CN104021140B (en) | A kind of processing method and processing device of Internet video | |
CN105095211A (en) | Acquisition method and device for multimedia data | |
CN103744877A (en) | Public opinion monitoring application system deployed in internet and application method | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN106844640A (en) | A kind of web data analysis and processing method | |
CN106528802A (en) | Data collecting method and device | |
CN103425650A (en) | Recommendation searching method and recommendation searching system | |
CN104182482A (en) | Method for judging news list page and method for screening news list page | |
CN107277115A (en) | A kind of content delivery method and device | |
CN109635084A (en) | A kind of real-time quick De-weight method of multi-source data document and system | |
CN107688563B (en) | Synonym recognition method and recognition device | |
CN108536700A (en) | A kind of method that nothing buries a collector journal | |
CN104239285A (en) | New article chapter detecting method and device | |
CN110008393B (en) | Method and equipment for acquiring website information | |
CN103595747A (en) | User-information recommending method and system | |
CN107070897A (en) | Network log storage method based on many attribute Hash duplicate removals in intruding detection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170322 |
|
RJ01 | Rejection of invention patent application after publication |