CN106528802A - Data collecting method and device - Google Patents

Data collecting method and device Download PDF

Info

Publication number
CN106528802A
CN106528802A CN201610998106.4A CN201610998106A CN106528802A CN 106528802 A CN106528802 A CN 106528802A CN 201610998106 A CN201610998106 A CN 201610998106A CN 106528802 A CN106528802 A CN 106528802A
Authority
CN
China
Prior art keywords
target
gathered data
data
gathered
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610998106.4A
Other languages
Chinese (zh)
Inventor
陈桓
蔡晓胜
张良杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kingdee Software China Co Ltd
Original Assignee
Kingdee Software China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kingdee Software China Co Ltd filed Critical Kingdee Software China Co Ltd
Priority to CN201610998106.4A priority Critical patent/CN106528802A/en
Publication of CN106528802A publication Critical patent/CN106528802A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a data collecting method and device. The data collecting method comprises the following steps that a target theme and a target theme collecting website are determined; target webpage links corresponding to the target theme are determined in a plurality of webpage links included in the target theme collecting website; content in a webpage corresponding to each target webpage link is collected, and a plurality of pieces of collected data are obtained; a result data set is determined according to the matching degree of the target theme and each piece of collected data. According to the technical scheme, the target webpage links corresponding to the target theme are determined in a targeted mode, so that less content is collected from the webpage corresponding to each target webpage link, correlation with the target theme is large, and the precision of data collection and data value density are improved.

Description

A kind of collecting method and device
Technical field
The present invention relates to Internet technical field, more particularly to a kind of collecting method and device.
Background technology
With the fast development of Internet technology, the application of big data is more and more.Under big data scene, data acquisition Demand gradually increase.
In the prior art, when the data of certain theme are needed, obtained from the Internet by non-directional reptile mostly Mass data, then based on the mass data for getting, by complicated Data Matching algorithm, filters out related to theme Data.
This method haves the shortcomings that certain, and the data volume of basic data is too big, and non-relevant data accounting is higher, often very Hardly possible correctly picks out the data closely related with theme, and precision is relatively low.In the big data epoch, the data value density of presentation compared with It is low.
The content of the invention
It is an object of the invention to provide a kind of collecting method and device, to improve the precision and data of data acquisition Value density.
To solve above-mentioned technical problem, the present invention provides following technical scheme:
A kind of collecting method, including:
Determine target topic and target collection website;
In multiple web page interlinkages that the target gathers that website includes, the corresponding target web of the target topic is determined Link;
The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained;
According to the target topic and the matching degree of every gathered data, result data set is determined.
In a kind of specific embodiment of the present invention, the corresponding target web link of the target topic is determined described Afterwards, before described each target web of collection links the content on corresponding webpage, also include:
The corresponding target web link of the target topic to determining carries out filtration treatment.
In a kind of specific embodiment of the present invention, the determination target topic and target gather website, including:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, the matching according to the target topic and every gathered data Degree, determines result data set, including:
Determine the key word of every gathered data;
Determine the text similarity of the target topic and the key word of every gathered data;
For every gathered data, if the target topic is high with the text similarity of the key word of the gathered data In preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, the key word for determining every gathered data, including:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
A kind of data acquisition unit, including:
Target determination module, for determining target topic and target collection website;
Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target The corresponding target web link of theme;
Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of Gathered data;
Result data determining module, for the matching degree according to the target topic and every gathered data, it is determined that knot Fruit data acquisition system.
In a kind of specific embodiment of the present invention, also include:
Link filter module, for it is described determine the corresponding target web link of the target topic after, described adopt Before collecting the content that each target web is linked on corresponding webpage, to the corresponding target web chain of the target topic for determining Tap into row filtration treatment.
In a kind of specific embodiment of the present invention, the target determination module, specifically for:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, the result data determining module, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining the text of the target topic and the key word of every gathered data Similarity;
Result data determination sub-module, for for every gathered data, if the target topic and the collection number According to key word text similarity be higher than preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, the key word determination sub-module, specifically for:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
The technical scheme provided using the embodiment of the present invention, after it is determined that target topic and target gather website, in mesh In multiple web page interlinkages that mark collection website includes, the corresponding target web link of target topic is determined, each target is gathered Content in the corresponding webpage of web page interlinkage, obtains a plurality of gathered data, according to matching for target topic and every gathered data Degree, it may be determined that result data set.Orientation determines the corresponding target web link of target topic so that from each target The content collected in the corresponding webpage of web page interlinkage is less, larger with the dependency of target topic, improves data acquisition Precision and data value density.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of implementing procedure figure of collecting method in the embodiment of the present invention;
Fig. 2 is a kind of structural representation of data acquisition unit in the embodiment of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.
A kind of collecting method is embodiments provided, the method can apply to search engine and provide the user In the application scenarios of retrieval service.Search engine refers to information of collecting from the Internet, after information is organized and is processed, Retrieval service is provided the user, the system that the related information of user search is showed into user.
The technical scheme provided by the embodiment of the present invention can intelligently carry out data acquisition, according to the target master for determining Topic, using the orientation filter capacity of search engine, with reference to secondary content filtering method, can accurately filter out target and adopt With the closely related content of target topic in collection website.
It is shown in Figure 1, a kind of implementing procedure figure of the collecting method provided by the embodiment of the present invention, the method May comprise steps of:
S110:Determine target topic and target collection website.
When user has the demand of gathered data, the target topic and target collection net of data to be gathered can be first determined Stand.
The present invention a kind of specific embodiment in, can according to the key word of user input, determine target topic and Target gathers website.
In embodiments of the present invention, input interface can be provided the user, user is connect by the input according to self-demand Mouth can be input into key word.The key word can be any one or more nouns such as enterprise's name, name, event, relation.Can be with The key word of user input is determined directly as into target topic.
User can also be input into the chained address that target gathers website by the input interface, so as to according to user input Chained address, it may be determined that target gathers website.
Or, the target topic for determining can be passed through, target collection website is automatically determined.Such as, pre-build substantial amounts of The corresponding relation of theme and website, when it is determined that after target topic, can find in the corresponding relation for pre-building and target master Inscribe corresponding target collection website.
The embodiment of the present invention is applied to the data acquisition of arbitrary theme and arbitrary collection website, and versatility is higher.
S120:In multiple web page interlinkages that target gathers that website includes, the corresponding target web chain of target topic is determined Connect.
In step S110, it is determined that target topic and target collection website.Each website includes multiple web page interlinkages, Different web pages include different contents in linking corresponding webpage.Target collection website equally includes multiple web page interlinkages.
In multiple web page interlinkages that target gathers that website includes, it may be determined that the corresponding target web chain of target topic Connect.Specifically, website can be gathered as target with target, filters out a series of target webs related to target topic and link.Mesh Mark web page interlinkage can have one or more, and the content that each target web link is included is related to target topic.
S130:The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained.
In embodiments of the present invention, can link for each target web, the target network is gathered by non-directional mode Full content in the corresponding webpage of page link, obtains a plurality of gathered data.
In actual applications, multithreading can be opened, corresponding web page contents is linked to different target webs respectively and is entered Row collection, it is to avoid resource contention, improves collecting efficiency.
The corresponding target web link of target topic is first determined, then gathers each target web and linked in corresponding webpage Content so that the amount of content data for collecting is less, reduces the difficulty of subsequent treatment.
The present invention a kind of specific embodiment in, after step S120, before step S130, can also include with Lower step:
The corresponding target web link of target topic to determining carries out filtration treatment.
After it is determined that the corresponding target web of target topic is linked, can be to the corresponding target network of target topic of determination Page link carries out filtration treatment.Specifically, the correctness of target web link can be analyzed, picks out correct webpage Link, deletes web page interlinkage, invalid web pages link of repetition etc..
Further, in step s 130, content of the collection in the corresponding webpage of each web page interlinkage of filtration treatment, with Improve the efficiency of data acquisition.
S140:According to target topic and the matching degree of every gathered data, result data set is determined.
Target topic is the theme determined according to user's request, and finally data to be obtained should be matched with target topic More data.
A plurality of gathered data is obtained in step S130, can calculate target topic and every gathered data matches journey Degree.According to the matching degree of target topic and every gathered data, it may be determined that result data set.
In a kind of specific embodiment of the present invention, step S140 may comprise steps of:
Step one:Determine the key word of every gathered data;
Step 2:Determine the text similarity of target topic and the key word of every gathered data;
Step 3:For every gathered data, if target topic is similar to the text of the key word of the gathered data Degree is then integrated into the gathered data in result data set higher than preset first threshold value.
For ease of description, above three step is combined and is illustrated.
Every gathered data may be considered and is made up of multiple basic words.For every gathered data, can be from this The key word of the gathered data is determined in the basic word that bar gathered data is included.
In a kind of specific embodiment of the present invention, above-mentioned steps one may comprise steps of:
First step:For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection number According to basic word set;
Second step:Determine the frequency that each basic word occurs in the gathered data;
3rd step:Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
For every gathered data, the gathered data is carried out after word segmentation processing, it is possible to obtain the gathered data The set of basic word.In embodiments of the present invention, basic word is the word with practical significance, such as name, place name, action and action Object etc., can exclude the function word without practical significance, as " ", " ", " obtaining " etc..
It is understood that the frequency that basic word occurs in gathered data is more, then the basic word can more represent this and adopt Collection data implication to be expressed.For a basic word of a gathered data, the basic word goes out in the gathered data Existing frequency is:The frequency that all basic word of the frequency/gathered data that the basic word occurs in the gathered data occurs It is cumulative and.
For every gathered data, after obtaining the set of basic word of the gathered data, it may be determined that each basic word Basic word of the frequency higher than default Second Threshold is defined as the gathered data by the frequency occurred in the gathered data Key word.
Further, it may be determined that the text similarity of target topic and the key word of every gathered data.Specifically, may be used With the algorithm using prior art Chinese version similarity, the embodiment of the present invention is repeated no more to this.
For every gathered data, if target topic is higher than pre- with the text similarity of the key word of the gathered data If first threshold, then show that the gathered data with target topic relatively, can be integrated into result by the gathered data In data acquisition system.
It should be noted that first threshold and Second Threshold can be set according to practical situation and be adjusted, the present invention Embodiment is without limitation.
The method provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data, Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition Degree and data value density.
In addition, the embodiment of the present invention by means of the Millisecond search capability of search engine, orientation can be completed within the several seconds Acquisition tasks.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of data acquisition unit, is described below A kind of data acquisition unit can be mutually to should refer to a kind of above-described collecting method.
Shown in Figure 2, the device can be included with lower module:
Target determination module 210, for determining target topic and target collection website;
Link determining module 220, for, in multiple web page interlinkages that target gathers that website includes, determining target topic pair The target web link answered;
Gathered data obtains module 230, for gathering the content in each corresponding webpage of target web link, obtains many Bar gathered data;
Result data determining module 240, for according to target topic and the matching degree of every gathered data, determining result Data acquisition system.
The device provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data, Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition Degree and data value density.
In a kind of specific embodiment of the present invention, also include:
Link filter module, for it is determined that after the corresponding target web link of target topic, gathering each target network Before content on the corresponding webpage of page link, the corresponding target web link of target topic to determining carries out filtration treatment.
In a kind of specific embodiment of the present invention, target determination module 210, specifically for:
According to the key word of user input, target topic and target collection website are determined.
In a kind of specific embodiment of the present invention, result data determining module 240, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining that target topic is similar to the text of the key word of every gathered data Degree;
Result data determination sub-module, for for every gathered data, if target topic and the gathered data The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.
In a kind of specific embodiment of the present invention, key word determination sub-module, specifically for:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data Set;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
In this specification, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other The difference of embodiment, between each embodiment same or similar part mutually referring to.For dress disclosed in embodiment For putting, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part Illustrate.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can use different methods to realize described function to each specific application, but this realization should not Think beyond the scope of this invention.
The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Specific case used herein is set forth to the principle and embodiment of the present invention, and above example is said It is bright to be only intended to help and understand technical scheme and its core concept.It should be pointed out that common for the art For technical staff, under the premise without departing from the principles of the invention, some improvement and modification can also be carried out to the present invention, these Improve and modification is also fallen in the protection domain of the claims in the present invention.

Claims (10)

1. a kind of collecting method, it is characterised in that include:
Determine target topic and target collection website;
In multiple web page interlinkages that the target gathers that website includes, the corresponding target web chain of the target topic is determined Connect;
The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained;
According to the target topic and the matching degree of every gathered data, result data set is determined.
2. collecting method according to claim 1, it is characterised in that determine that the target topic is corresponding described After target web link, before the content gathered on each corresponding webpage of target web link, also include:
The corresponding target web link of the target topic to determining carries out filtration treatment.
3. collecting method according to claim 1, it is characterised in that the determination target topic and target collection net Stand, including:
According to the key word of user input, target topic and target collection website are determined.
4. the collecting method according to any one of claims 1 to 3, it is characterised in that described according to the target master The matching degree with every gathered data is inscribed, result data set is determined, including:
Determine the key word of every gathered data;
Determine the text similarity of the target topic and the key word of every gathered data;
For every gathered data, if the target topic is higher than pre- with the text similarity of the key word of the gathered data If first threshold, then the gathered data is integrated in result data set.
5. collecting method according to claim 4, it is characterised in that the key of every gathered data of determination Word, including:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection of the basic word of the gathered data Close;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
6. a kind of data acquisition unit, it is characterised in that include:
Target determination module, for determining target topic and target collection website;
Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target topic Corresponding target web link;
Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of collection Data;
Result data determining module, for according to the target topic and the matching degree of every gathered data, determining number of results According to set.
7. data acquisition unit according to claim 6, it is characterised in that also include:
Link filter module, for it is described determine the corresponding target web link of the target topic after, the collection it is every Before individual target web links the content on corresponding webpage, the corresponding target web chain of the target topic to determining is tapped into Row filtration treatment.
8. data acquisition unit according to claim 6, it is characterised in that the target determination module, specifically for:
According to the key word of user input, target topic and target collection website are determined.
9. the data acquisition unit according to any one of claim 6 to 8, it is characterised in that the result data determines mould Block, including:
Key word determination sub-module, for determining the key word of every gathered data;
Text similarity determination sub-module, for determining that the target topic is similar to the text of the key word of every gathered data Degree;
Result data determination sub-module, for for every gathered data, if the target topic and the gathered data The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.
10. data acquisition unit according to claim 9, it is characterised in that the key word determination sub-module, it is concrete to use In:
For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection of the basic word of the gathered data Close;
Determine the frequency that each basic word occurs in the gathered data;
Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.
CN201610998106.4A 2016-11-11 2016-11-11 Data collecting method and device Pending CN106528802A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610998106.4A CN106528802A (en) 2016-11-11 2016-11-11 Data collecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610998106.4A CN106528802A (en) 2016-11-11 2016-11-11 Data collecting method and device

Publications (1)

Publication Number Publication Date
CN106528802A true CN106528802A (en) 2017-03-22

Family

ID=58351460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610998106.4A Pending CN106528802A (en) 2016-11-11 2016-11-11 Data collecting method and device

Country Status (1)

Country Link
CN (1) CN106528802A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704535A (en) * 2017-09-21 2018-02-16 广州大学 Info web acquisition methods, apparatus and system based on Topic Similarity
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system
CN110297994A (en) * 2019-06-03 2019-10-01 北京金蝶管理软件有限公司 Acquisition method, device, computer equipment and the storage medium of web data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN105653546A (en) * 2014-11-11 2016-06-08 北大方正集团有限公司 Method and system for searching target theme

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN105653546A (en) * 2014-11-11 2016-06-08 北大方正集团有限公司 Method and system for searching target theme
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704535A (en) * 2017-09-21 2018-02-16 广州大学 Info web acquisition methods, apparatus and system based on Topic Similarity
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system
CN110297994A (en) * 2019-06-03 2019-10-01 北京金蝶管理软件有限公司 Acquisition method, device, computer equipment and the storage medium of web data

Similar Documents

Publication Publication Date Title
CN106503014B (en) Real-time information recommendation method, device and system
CN103870461B (en) Subject recommending method, device and server
KR102080362B1 (en) Query expansion
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN103500213B (en) Page hot-spot resource updating method and device based on pre-reading
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN103116638B (en) Webpage screening method and device thereof
CN106021418B (en) The clustering method and device of media event
CN103984757B (en) Search results pages is inserted the method and system of news information entry
CN104021140B (en) A kind of processing method and processing device of Internet video
CN105095211A (en) Acquisition method and device for multimedia data
CN103744877A (en) Public opinion monitoring application system deployed in internet and application method
CN105528422A (en) Focused crawler processing method and apparatus
CN106844640A (en) A kind of web data analysis and processing method
CN106528802A (en) Data collecting method and device
CN103425650A (en) Recommendation searching method and recommendation searching system
CN104182482A (en) Method for judging news list page and method for screening news list page
CN107277115A (en) A kind of content delivery method and device
CN109635084A (en) A kind of real-time quick De-weight method of multi-source data document and system
CN107688563B (en) Synonym recognition method and recognition device
CN108536700A (en) A kind of method that nothing buries a collector journal
CN104239285A (en) New article chapter detecting method and device
CN110008393B (en) Method and equipment for acquiring website information
CN103595747A (en) User-information recommending method and system
CN107070897A (en) Network log storage method based on many attribute Hash duplicate removals in intruding detection system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322

RJ01 Rejection of invention patent application after publication