CN107239558A - Common interconnection network collecting method - Google Patents

Common interconnection network collecting method Download PDF

Info

Publication number
CN107239558A
CN107239558A CN201710433582.6A CN201710433582A CN107239558A CN 107239558 A CN107239558 A CN 107239558A CN 201710433582 A CN201710433582 A CN 201710433582A CN 107239558 A CN107239558 A CN 107239558A
Authority
CN
China
Prior art keywords
affairs
collection
web page
web
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710433582.6A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201710433582.6A priority Critical patent/CN107239558A/en
Publication of CN107239558A publication Critical patent/CN107239558A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of common interconnection network collecting method, this method includes:Transaction scheduling is performed, the type of collection affairs is judged, if media or file link, then corresponding collection document process is performed;If web retrieval transactions access address in history crawl storehouse, is not acquired by newfound webpage;If this collection affairs is in history crawl storehouse, the last time that this web page address is obtained from history crawl storehouse gathers information;If interval time exceedes renewal frequency, compare the content of pages size and last content of pages size of current web page address, if unequal, obtain this web page interlinkage page source code, the collection information of this web page address in history access database is updated, Web Cleanout is performed and extracts.The present invention proposes a kind of common interconnection network collecting method, and high efficient data capture is carried out using transaction controlling strategy, and data mining is carried out for the coupled relation between multi dimensional object.

Description

Common interconnection network collecting method
Technical field
The present invention relates to data retrieval, more particularly to a kind of common interconnection network collecting method.
Background technology
With continuing to develop for Web technologies, network information resource is just increased in the way of geometry speed.How from internet Quick-searching goes out the useful data related to user in magnanimity information turns into current urgent problem.Search engine is exactly Grow up on the basis of information retrieval technique.Search engine helps the present invention preferably to express and store in real world Essential information, and by analyzing the connection information in search engine, having for hiding information can be excavated as a kind of Use instrument.The dependence limited search word of existing search engine merely goes to express user's request, there is this and expresses incomplete problem. Even same search term, the desired result of different users may be also different.Such as microblog system, if it is considered that The relation of microblogging and related interactive object, it can with it is abstract be a heterogeneous network, wherein containing microblogging, information, label And the node such as user.Concern and bean vermicelli relation is there is between microblogging and microblogging, existed between microblogging and information deliver and Forwarding relation, is an inclusion relation between microblogging and label, and holding relationship is there is between user and microblogging.It is existing to search Rope instrument does not consider that the complex environment of above-mentioned multi dimensional object formation carries out data mining.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of common interconnection network data acquisition side Method, including:
Step 1. affairs obtain a non-NULL collection transaction object from transaction queues;If getting sky transaction object, Then perform transaction scheduling;
Step 2. judges whether the depth for gathering affairs exceedes maximum depth;Affairs are obtained from current collection transaction object Take the sampling depth where its Current transaction object;Website sampling depth such as sampling depth not less than system configuration, then affairs Continue step 3;
Step 3. judges the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not webpage Affairs are gathered, then perform step 5;
Step 4. judges whether new web page or unfinished web page interlinkage;If this collection transactions access address does not exist In history crawl storehouse, then i.e. step 7 is acquired by newfound webpage;If this collection affairs is in history crawl storehouse, from Last time collection information, i.e. reference address, access time, page-size, the renewal frequency of this web page address are obtained in history crawl storehouse Rate, rhizosphere name;Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, such as Fruit alreadys exceed, then compares the content of pages size and last content of pages size of current web page address, if equal, no It is acquired, if unequal, continues step 6;
Step 5. then performs corresponding collection document process if media or file link;If illegal link, then This is recorded to link extremely;
Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, holds Row step 8;
Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, increases this net in history access database The access record of page address;
Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the spy specified from webpage source code Reference ceases, and removes garbage or noise data in webpage source code, then goes out the letter of needs from cleaned extracting data again Breath.
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of common interconnection network collecting method, and carrying out efficient data using transaction controlling strategy adopts Collection, data mining is carried out for the coupled relation between multi dimensional object.
Brief description of the drawings
Fig. 1 is the flow chart of common interconnection network collecting method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of common interconnection network collecting method.Fig. 1 is according to embodiments of the present invention Common interconnection network collecting method flow chart.
The search engine of the present invention can be divided into acquisition module and analysis module.Acquisition module include crawl site database, Crawl website scheduling unit, transaction management controller, affairs container, recording controller, basic database.Transaction management controller For multiple establishments for crawling affairs, startup, operation control and destruction.Affairs have oneself independent container, for transaction resource Management, specifically include:Save buffer unit is crawled, the affairs website number to be crawled is cached for setting up queue in internal memory According to;Transactional cache unit, the data for caching affairs itself;Memory buffers unit, caches the correlation to be stored to database Data;Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;Number According to cleaning extracting unit, the code of collection is cleaned, effective information is extracted.These effective informations include the mark of webpage Font type, the media characteristic information occurred in topic, keyword, summary, text, webpage.Obtain what web page quality grade was evaluated Relevant information and obtain and new in webpage crawl website;Data storage analytic unit, extracts data by cleaning and is converted to easily The form of storage, data are compressed, and constitute pending database search character string.Recording controller is used for program and data Data exchange processing between storehouse;Including global data source cache unit, data scheduling unit, data access administrative unit.Entirely Office, which crawls buffer unit, to be used to handle transaction latency of many affairs when to critical resource access, reduces many affairs to data storehouse Operational access number of times.Each web crawlers only one of which overall situation crawls buffer unit example.Data access administrative unit is used to count Handled according to the data interaction of storehouse and program.Data scheduling unit realizes the scheduling that single affairs are crawled, when single affairs crawl caching list When not crawled in member, by data scheduling unit from the overall situation crawl buffer unit in obtain and some crawl to affairs that to crawl caching single Member.Data scheduling unit only one of which example in whole program.
The web crawlers of search engine operationally, reading program configuration file first, and will when preloading caching collection The data used;Task manager.According to configuration information, each affairs is initialized, and control the operation of affairs;At affairs acquisition Reason task, first carries out crawling link duplicate removal inspection, analysis crawls the type of link, performs different grab types at different places Reason mode, in collection, analysis is new collection affairs or more new task, and after the webpage source code of link is got, it is right The webpage source code collected performs cleaning, filtering, according to info web correlated characteristic rule, extracts effective information;Affairs pair The information extracted carries out conversion process, is cached;When caching data to be saved reach certain amount, affairs perform caching Data loading processing;Task manager timing simultaneously monitors the execution state of each affairs, and management is controlled to abnormal transaction.
Before crawling, all possible combination domain name is traveled through according to domain name create-rule successively, combination domain name is carried out Detect successively, identification effective domain name and invalid domain name, set up rhizosphere name storehouse;Then the webpage source code of navigation website is obtained, according to Rhizosphere name composition rule extracts root site address and link text from webpage source code, updates rhizosphere name storehouse.
The global collection transaction scheduling unit of each web crawlers only one of which.When the save buffer that crawls of affairs is sky When, transactions requests or wait data scheduling unit crawl website from overall situation collection caching website or database transmission.Transaction scheduling Process is specially:Affairs obtain data scheduling unit control authority;Whether be empty, be such as sky if judging global collection caching website, Then a number of crawl is obtained from database and be cached to global collection caching source, if being not sky, then crawl website from the overall situation and obtain Take it is a number of crawl to Current transaction crawl buffer unit.
For single affairs, its collecting flowchart is specific as follows:
1. affairs obtain a non-NULL collection transaction object from transaction queues.If getting sky transaction object, hold Row transaction scheduling.
2. judge whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object Sampling depth where Current transaction object.If sampling depth exceedes the website sampling depth of system configuration, Current transaction Collection terminates.As sampling depth not less than system configuration website sampling depth, then affairs continue step 3.
3. judge the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval Affairs, then perform step 5.
4. judge whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history Capture in storehouse, be then acquired i.e. step 7 by newfound webpage.If this collection affairs is in history crawl storehouse, from history The last time collection information of this web page address is obtained in crawl storehouse:Reference address, access time, page-size, renewal frequency, rhizosphere Name.Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if do not surpassed Cross, then without collection, collection terminates;If it has been exceeded, then comparing the content of pages size and upper one of current web page address Secondary content of pages size, if equal, without collection, if unequal, continues step 6.
5. if media or file link, then perform corresponding collection document process;If illegal link, then record This is linked extremely.
6. obtaining this web page interlinkage page source code, the collection information of this web page address in history access database is updated, step is performed Rapid 8.
7. gathering new task webpage, the source code of this web page interlinkage page is obtained, with increasing this webpage in history access database The access record of location.
8. performing Web Cleanout to extract, the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.Enter One step, according to text similarity measurement algorithm, extracts title from webpage, keyword, descriptive text, mark defined in webpage Link, media resource in topic, text, webpage are used for Ordination.In cleaning, affairs capture program is obtained currently first The web page coding of affairs.Start washer, and initialize.Remove the pattern coding in web page coding and explain coding;Remove net Scripted code in page number, and recognize that current web page whether there is media file according to script code information simultaneously.If deposited Then preserved.Information after extraction is compressed to the form for being converted into being easy to storage and stored.
It is that each affairs dispose special duplicate removal container, each container has only stored oneself for page repeated links The mapping code of the chained address accessed.Duplicate removal container only needs to record the chain oneself accessed under same root domain name website Connect, discard processing is carried out to being not belonging to the web page address of this rhizosphere under one's name.When affairs start to gather another root site information, The history access record of duplicate removal container is emptied, new root site access record is recorded again.Information acquisition device is crawled into website Depth sets threshold value, during the operation of each affairs, the internal memory shared by actual duplicate removal container can by crawl website depth threshold come Control.
To reduce transaction latency, the present invention uses multilayer buffer structure, each layer is cached according to the memory size of computer Size is configured.The overall situation is crawled first and cached.In the access connection procedure to crawling database, using disposable Acquisition batch crawls result and cached.Secondly caching is crawled using single affairs itself.Each affairs each possess one Gathered data source cache region.Then the data of the generation to affairs in processing procedure are cached, and are included in link duplicate removal During inspection, webpage that cache access is crossed, media links address.Last layer of caching is caching data to be saved.When to be saved Data reach certain amount after, affairs just carry out storage preservation to data.
The analysis module of the search engine is used to gather at the basic data progress analysis of the text returned, media Reason, is that keyword sets up index, is easy to the search of searching system.It is right successively when analysis module carries out keyword extraction to text Text carries out keyword extraction.For numeral or Chinese figure Chinese character, if continued presence, the present invention is as a key Word is handled.For English alphabet, if run into the non-English letters such as space, divided.For Chinese character group Into sentence, then handled by following order:If 1. character is a continuous Chinese-character digital, and Chinese-character digital is continuous Put together.If 2. continuous three Chinese characters are all independent, phrase is not formed, the present invention makees this three independent words Divided for a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm Point.For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special Character, each special character is a keyword.
Keyword extraction step is carried out to text to further comprise:
The Chinese phrase of loading.Text to be divided is obtained from participle object;Obtain analysis position and character, discriminatory analysis Position whether be text to be divided end, if the rearmost position of text to be divided, then ready-portioned text adds Upper cutting symbol, along with the last character of text to be divided, group has been divided in new ready-portioned text, now text Into.Since analysis position, cutting symbol position is found.Behind the position for finding cutting symbol, interception analysis position to cutting accords with position Between character, cutting symbol is plus the text after dividing, composition ready-portioned text.
The storage of Chinese basis dictionary is using double-deck Hash list object storage.First Chinese character using word phrase as key, Using another Hash list object as the storage organization of key assignments;The Hash list object storage of key assignments is the phrase started with key The first Chinese character of removing after phrase remainder.The storage organization of the key assignments of first Chinese character be using second Chinese character of word as Key, using chain type array as key assignments.The first two Chinese character of word in this chain type storage of array dictionary is identical, the from word the 3rd Individual Chinese character starts different text sequences.
It is that foundation is divided to text by using dictionary, whether there is to find phrase in dictionary.If in word Exist in storehouse, then continue matching process, if not present in dictionary, then matching terminates, and is divided.
Further, after search obtains collections of web pages, constructed and be based on respectively according to search and webpage own content feature The similar diagram of feature, while search and webpage binary crelation figure is built based on the interest relation between search and webpage, given few The classification of unmarked search and webpage is predicted in the case of amount search and webpage category label.
First have to build a figure, node table sample notebook data, while between representing sample according to sample data and its contact Contact, the weight on side represents the tightness degree contacted between sample.The search for containing a variety of different objects structures is drawn G=(V, E) can be expressed as by holding up, and wherein V=Q ∪ D can be expressed as the set on different types of summit, and E is connection summit The set on side.Q is the set of search, and D is the set of webpage.E=EQQ∪EQD∪EDD, wherein EQQ=Q × Q, EQD=Q × D, EDD =D × D.Make GQ=(Q, EQQ), GQD=(Q, D, EQD), GD=(D, EDD), then G=GQQ∪GQD∪GDD.Wherein GQRepresent by searching for The subgraph that node is built, GDRepresent the subgraph built by web page joint, GQDRepresent by search node and web page joint according to interest The subgraph that relation is built.
The weight w on side is defined based on the following distance function between nodeij
wij=exp (- d (xi,xj)/2σ2)
It is two text vector x in text calculatingi, xjBetween included angle cosine;xiAnd xjSection in k neighbours each other Point;Wherein d (xi,xj) be distance function | | xi-xj| |, σ is regulation parameter.
Original graph structure is changed according to the discriminant information of advance marker samples:
1. construct GQAnd GDAnd GQD.Calculate GQThe average weight w on side between upper all nodesq, GDSide between upper all nodes Average weight wd
2:If search flag data is divided into c class, P is expressed asq={ Pq 1, Pq 2..., Pq c, wherein Pq iRepresent i-th The core set of the mark search of individual classification.Make Mq iThe paired constraint set of i-th of search category mark is represented, if x ∈ Pq iAnd y ∈ Pq i, then (x, y) is added into Mq i.If Web Page Tags data are divided into c class, P is expressed asd={ Pd 1, Pd 2..., Pd c, wherein Pd iRepresent the core set of the marking of web pages of i-th of classification.Make Md iThe paired constraint set of i-th of webpage category label is represented, if x ∈Pd iAnd y ∈ Pd i, then (x, y) is added into Md i
3:If searching for sample to (ql, qm)∈Mq i, qkFor qlAnd qmNeighbours, wlk<wqAnd wmk>wq, then by qkAdd Pq i, (ql, qk) and (qm, qk) is added into Mq i
4:3 are repeated, until PqNo longer change.
5:If webpage sample is to (dl, dm)∈Md i, dkFor dlAnd dmNeighbours, wlk<wdAnd wmk>wd, then by dkAdd Pd i, By (dl, dk) and (dm, dk) add Md i
6:5 are repeated, until PdNo longer change.
7:If searching for sample to (ql, qm)∈Mq i, then G is changedQMake wlm=1, if ql∈Pq iThen change GQMake wlm=0.If Webpage sample is to (dl, dm)∈Md i, then G is changeddMake wlm=1, if dl∈Pd iThen change GdMake wlm=0.
8:If searching for ql∈gP:And webpage dmgPj, then and modification GQDMake wlm=1.
Wherein wlk, wmk, wlmRespectively in search ql、qm、qlRespectively in webpage dk, dk, dmIn weight.
Relation between search and webpage can be enriched by said process, make the contact between similar node more tight Gather, the contact between different classes of node is looser, so as to preferably be classified using procedure below:
1. for classification j ∈ { 1 ..., c }, and the node i ∈ { 1 ..., n } in above-mentioned subgraph, construction n × c's is first Beginningization mark matrix Y.
2. construct the adjacency matrix W in homogeneous network respectively according to the similarity measurement between isomorphism nodeQQ, WDD, according to Relation construction adjacency matrix W between heterogeneous nodesQDAnd corresponding transposed matrix WDQ
3. structural matrixWherein
4. F (0)=Y is taken, iterative calculation F (t+1)=μαSF(t)+(1-μα) Y, wherein μαFor the ginseng between (0,1) Number.
5. setting F* as { F (t) } limit of a sequence, then the node v in G is schemediAccording to yi=argmaxj<cFijCarry out contingency table Note.
The above-mentioned neighbor node that in an iterative process, label information is constantly broadcast to oneself by each node of figure is straight A stable state is reached to them.
In summary, the present invention proposes a kind of common interconnection network collecting method, is carried out using transaction controlling strategy High efficient data capture, data mining is carried out for the coupled relation between multi dimensional object.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (1)

1. a kind of common interconnection network collecting method, for the single affairs using search engine acquisition module to website basis Data are acquired, it is characterised in that including:
Step 1. affairs obtain a non-NULL collection transaction object from transaction queues;If getting sky transaction object, hold Row transaction scheduling;
Step 2. judges whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object Sampling depth where Current transaction object;Website sampling depth such as sampling depth not less than system configuration, then affairs continuation Step 3;
Step 3. judges the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval Affairs, then perform step 5;
Step 4. judges whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history Capture in storehouse, be then acquired i.e. step 7 by newfound webpage;If this collection affairs is in history crawl storehouse, from history Last time collection information, i.e. reference address, access time, page-size, renewal frequency, the root of this web page address are obtained in crawl storehouse Domain name;Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if Through more than, then compare the content of pages size of current web page address and last content of pages size, if equal, without Collection, if unequal, continues step 6;
Step 5. then performs corresponding collection document process if media or file link;If illegal link, then record This is linked extremely;
Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, performs step Rapid 8;
Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, with increasing this webpage in history access database The access record of location;
Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.
CN201710433582.6A 2017-06-09 2017-06-09 Common interconnection network collecting method Pending CN107239558A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710433582.6A CN107239558A (en) 2017-06-09 2017-06-09 Common interconnection network collecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710433582.6A CN107239558A (en) 2017-06-09 2017-06-09 Common interconnection network collecting method

Publications (1)

Publication Number Publication Date
CN107239558A true CN107239558A (en) 2017-10-10

Family

ID=59986106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710433582.6A Pending CN107239558A (en) 2017-06-09 2017-06-09 Common interconnection network collecting method

Country Status (1)

Country Link
CN (1) CN107239558A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753169A (en) * 2020-06-29 2020-10-09 金电联行(北京)信息技术有限公司 Data acquisition system based on internet
CN113535568A (en) * 2021-07-22 2021-10-22 工银科技有限公司 Verification method, device, equipment and medium for application deployment version
CN114925259A (en) * 2022-04-20 2022-08-19 北京网景盛世技术开发中心 Information acquisition and extraction method and system based on government portal and new media
CN116361362A (en) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204610A1 (en) * 2008-02-11 2009-08-13 Hellstrom Benjamin J Deep web miner
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN106294402A (en) * 2015-05-21 2017-01-04 阿里巴巴集团控股有限公司 The data search method of a kind of heterogeneous data source and device thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204610A1 (en) * 2008-02-11 2009-08-13 Hellstrom Benjamin J Deep web miner
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN106294402A (en) * 2015-05-21 2017-01-04 阿里巴巴集团控股有限公司 The data search method of a kind of heterogeneous data source and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周庭安: "分布式搜索引擎研究与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753169A (en) * 2020-06-29 2020-10-09 金电联行(北京)信息技术有限公司 Data acquisition system based on internet
CN113535568A (en) * 2021-07-22 2021-10-22 工银科技有限公司 Verification method, device, equipment and medium for application deployment version
CN113535568B (en) * 2021-07-22 2023-09-05 工银科技有限公司 Verification method, device, equipment and medium for application deployment version
CN114925259A (en) * 2022-04-20 2022-08-19 北京网景盛世技术开发中心 Information acquisition and extraction method and system based on government portal and new media
CN116361362A (en) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification
CN116361362B (en) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Similar Documents

Publication Publication Date Title
JP5338238B2 (en) Automatic ontology generation using word similarity
CN102053991B (en) Method and system for multi-language document retrieval
Liu et al. Identifying web spam with the wisdom of the crowds
Fujimura et al. Topigraphy: visualization for large-scale tag clouds
CN110543595B (en) In-station searching system and method
CN107239558A (en) Common interconnection network collecting method
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN104182482B (en) A kind of news list page determination methods and the method for screening news list page
CN107506472B (en) Method for classifying browsed webpages of students
CN110990676A (en) Social media hotspot topic extraction method and system
CN111611464A (en) Big data-based public opinion monitoring platform
JP4769151B2 (en) Document set analysis apparatus, document set analysis method, program implementing the method, and recording medium storing the program
CN110334343A (en) The method and system that individual privacy information extracts in a kind of contract
CN109284441B (en) Dynamic self-adaptive network sensitive information detection method and device
CN113343012A (en) News matching method, device, equipment and storage medium
Huang et al. Design a batched information retrieval system based on a concept-lattice-like structure
Li et al. Geospatial data mining on the web: Discovering locations of emergency service facilities
CN107133366A (en) Data based on meta-search engine find method
CN107220367A (en) Internet data full-text search method
JP5321258B2 (en) Information collecting system, information collecting method and program thereof
Hardik et al. Link analysis of Wikipedia documents using mapreduce
CN107169065B (en) Method and device for removing specific content
CN114707003A (en) Method, equipment and storage medium for dissimilarity of names of thesis authors
JP2010244341A (en) Attribute expression acquisition method, device, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171010

RJ01 Rejection of invention patent application after publication