CN107239558A - Common interconnection network collecting method - Google Patents
Common interconnection network collecting method Download PDFInfo
- Publication number
- CN107239558A CN107239558A CN201710433582.6A CN201710433582A CN107239558A CN 107239558 A CN107239558 A CN 107239558A CN 201710433582 A CN201710433582 A CN 201710433582A CN 107239558 A CN107239558 A CN 107239558A
- Authority
- CN
- China
- Prior art keywords
- affairs
- collection
- web page
- web
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of common interconnection network collecting method, this method includes:Transaction scheduling is performed, the type of collection affairs is judged, if media or file link, then corresponding collection document process is performed;If web retrieval transactions access address in history crawl storehouse, is not acquired by newfound webpage;If this collection affairs is in history crawl storehouse, the last time that this web page address is obtained from history crawl storehouse gathers information;If interval time exceedes renewal frequency, compare the content of pages size and last content of pages size of current web page address, if unequal, obtain this web page interlinkage page source code, the collection information of this web page address in history access database is updated, Web Cleanout is performed and extracts.The present invention proposes a kind of common interconnection network collecting method, and high efficient data capture is carried out using transaction controlling strategy, and data mining is carried out for the coupled relation between multi dimensional object.
Description
Technical field
The present invention relates to data retrieval, more particularly to a kind of common interconnection network collecting method.
Background technology
With continuing to develop for Web technologies, network information resource is just increased in the way of geometry speed.How from internet
Quick-searching goes out the useful data related to user in magnanimity information turns into current urgent problem.Search engine is exactly
Grow up on the basis of information retrieval technique.Search engine helps the present invention preferably to express and store in real world
Essential information, and by analyzing the connection information in search engine, having for hiding information can be excavated as a kind of
Use instrument.The dependence limited search word of existing search engine merely goes to express user's request, there is this and expresses incomplete problem.
Even same search term, the desired result of different users may be also different.Such as microblog system, if it is considered that
The relation of microblogging and related interactive object, it can with it is abstract be a heterogeneous network, wherein containing microblogging, information, label
And the node such as user.Concern and bean vermicelli relation is there is between microblogging and microblogging, existed between microblogging and information deliver and
Forwarding relation, is an inclusion relation between microblogging and label, and holding relationship is there is between user and microblogging.It is existing to search
Rope instrument does not consider that the complex environment of above-mentioned multi dimensional object formation carries out data mining.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of common interconnection network data acquisition side
Method, including:
Step 1. affairs obtain a non-NULL collection transaction object from transaction queues;If getting sky transaction object,
Then perform transaction scheduling;
Step 2. judges whether the depth for gathering affairs exceedes maximum depth;Affairs are obtained from current collection transaction object
Take the sampling depth where its Current transaction object;Website sampling depth such as sampling depth not less than system configuration, then affairs
Continue step 3;
Step 3. judges the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not webpage
Affairs are gathered, then perform step 5;
Step 4. judges whether new web page or unfinished web page interlinkage;If this collection transactions access address does not exist
In history crawl storehouse, then i.e. step 7 is acquired by newfound webpage;If this collection affairs is in history crawl storehouse, from
Last time collection information, i.e. reference address, access time, page-size, the renewal frequency of this web page address are obtained in history crawl storehouse
Rate, rhizosphere name;Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, such as
Fruit alreadys exceed, then compares the content of pages size and last content of pages size of current web page address, if equal, no
It is acquired, if unequal, continues step 6;
Step 5. then performs corresponding collection document process if media or file link;If illegal link, then
This is recorded to link extremely;
Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, holds
Row step 8;
Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, increases this net in history access database
The access record of page address;
Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the spy specified from webpage source code
Reference ceases, and removes garbage or noise data in webpage source code, then goes out the letter of needs from cleaned extracting data again
Breath.
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of common interconnection network collecting method, and carrying out efficient data using transaction controlling strategy adopts
Collection, data mining is carried out for the coupled relation between multi dimensional object.
Brief description of the drawings
Fig. 1 is the flow chart of common interconnection network collecting method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention
State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of common interconnection network collecting method.Fig. 1 is according to embodiments of the present invention
Common interconnection network collecting method flow chart.
The search engine of the present invention can be divided into acquisition module and analysis module.Acquisition module include crawl site database,
Crawl website scheduling unit, transaction management controller, affairs container, recording controller, basic database.Transaction management controller
For multiple establishments for crawling affairs, startup, operation control and destruction.Affairs have oneself independent container, for transaction resource
Management, specifically include:Save buffer unit is crawled, the affairs website number to be crawled is cached for setting up queue in internal memory
According to;Transactional cache unit, the data for caching affairs itself;Memory buffers unit, caches the correlation to be stored to database
Data;Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;Number
According to cleaning extracting unit, the code of collection is cleaned, effective information is extracted.These effective informations include the mark of webpage
Font type, the media characteristic information occurred in topic, keyword, summary, text, webpage.Obtain what web page quality grade was evaluated
Relevant information and obtain and new in webpage crawl website;Data storage analytic unit, extracts data by cleaning and is converted to easily
The form of storage, data are compressed, and constitute pending database search character string.Recording controller is used for program and data
Data exchange processing between storehouse;Including global data source cache unit, data scheduling unit, data access administrative unit.Entirely
Office, which crawls buffer unit, to be used to handle transaction latency of many affairs when to critical resource access, reduces many affairs to data storehouse
Operational access number of times.Each web crawlers only one of which overall situation crawls buffer unit example.Data access administrative unit is used to count
Handled according to the data interaction of storehouse and program.Data scheduling unit realizes the scheduling that single affairs are crawled, when single affairs crawl caching list
When not crawled in member, by data scheduling unit from the overall situation crawl buffer unit in obtain and some crawl to affairs that to crawl caching single
Member.Data scheduling unit only one of which example in whole program.
The web crawlers of search engine operationally, reading program configuration file first, and will when preloading caching collection
The data used;Task manager.According to configuration information, each affairs is initialized, and control the operation of affairs;At affairs acquisition
Reason task, first carries out crawling link duplicate removal inspection, analysis crawls the type of link, performs different grab types at different places
Reason mode, in collection, analysis is new collection affairs or more new task, and after the webpage source code of link is got, it is right
The webpage source code collected performs cleaning, filtering, according to info web correlated characteristic rule, extracts effective information;Affairs pair
The information extracted carries out conversion process, is cached;When caching data to be saved reach certain amount, affairs perform caching
Data loading processing;Task manager timing simultaneously monitors the execution state of each affairs, and management is controlled to abnormal transaction.
Before crawling, all possible combination domain name is traveled through according to domain name create-rule successively, combination domain name is carried out
Detect successively, identification effective domain name and invalid domain name, set up rhizosphere name storehouse;Then the webpage source code of navigation website is obtained, according to
Rhizosphere name composition rule extracts root site address and link text from webpage source code, updates rhizosphere name storehouse.
The global collection transaction scheduling unit of each web crawlers only one of which.When the save buffer that crawls of affairs is sky
When, transactions requests or wait data scheduling unit crawl website from overall situation collection caching website or database transmission.Transaction scheduling
Process is specially:Affairs obtain data scheduling unit control authority;Whether be empty, be such as sky if judging global collection caching website,
Then a number of crawl is obtained from database and be cached to global collection caching source, if being not sky, then crawl website from the overall situation and obtain
Take it is a number of crawl to Current transaction crawl buffer unit.
For single affairs, its collecting flowchart is specific as follows:
1. affairs obtain a non-NULL collection transaction object from transaction queues.If getting sky transaction object, hold
Row transaction scheduling.
2. judge whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object
Sampling depth where Current transaction object.If sampling depth exceedes the website sampling depth of system configuration, Current transaction
Collection terminates.As sampling depth not less than system configuration website sampling depth, then affairs continue step 3.
3. judge the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval
Affairs, then perform step 5.
4. judge whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history
Capture in storehouse, be then acquired i.e. step 7 by newfound webpage.If this collection affairs is in history crawl storehouse, from history
The last time collection information of this web page address is obtained in crawl storehouse:Reference address, access time, page-size, renewal frequency, rhizosphere
Name.Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if do not surpassed
Cross, then without collection, collection terminates;If it has been exceeded, then comparing the content of pages size and upper one of current web page address
Secondary content of pages size, if equal, without collection, if unequal, continues step 6.
5. if media or file link, then perform corresponding collection document process;If illegal link, then record
This is linked extremely.
6. obtaining this web page interlinkage page source code, the collection information of this web page address in history access database is updated, step is performed
Rapid 8.
7. gathering new task webpage, the source code of this web page interlinkage page is obtained, with increasing this webpage in history access database
The access record of location.
8. performing Web Cleanout to extract, the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code
Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.Enter
One step, according to text similarity measurement algorithm, extracts title from webpage, keyword, descriptive text, mark defined in webpage
Link, media resource in topic, text, webpage are used for Ordination.In cleaning, affairs capture program is obtained currently first
The web page coding of affairs.Start washer, and initialize.Remove the pattern coding in web page coding and explain coding;Remove net
Scripted code in page number, and recognize that current web page whether there is media file according to script code information simultaneously.If deposited
Then preserved.Information after extraction is compressed to the form for being converted into being easy to storage and stored.
It is that each affairs dispose special duplicate removal container, each container has only stored oneself for page repeated links
The mapping code of the chained address accessed.Duplicate removal container only needs to record the chain oneself accessed under same root domain name website
Connect, discard processing is carried out to being not belonging to the web page address of this rhizosphere under one's name.When affairs start to gather another root site information,
The history access record of duplicate removal container is emptied, new root site access record is recorded again.Information acquisition device is crawled into website
Depth sets threshold value, during the operation of each affairs, the internal memory shared by actual duplicate removal container can by crawl website depth threshold come
Control.
To reduce transaction latency, the present invention uses multilayer buffer structure, each layer is cached according to the memory size of computer
Size is configured.The overall situation is crawled first and cached.In the access connection procedure to crawling database, using disposable
Acquisition batch crawls result and cached.Secondly caching is crawled using single affairs itself.Each affairs each possess one
Gathered data source cache region.Then the data of the generation to affairs in processing procedure are cached, and are included in link duplicate removal
During inspection, webpage that cache access is crossed, media links address.Last layer of caching is caching data to be saved.When to be saved
Data reach certain amount after, affairs just carry out storage preservation to data.
The analysis module of the search engine is used to gather at the basic data progress analysis of the text returned, media
Reason, is that keyword sets up index, is easy to the search of searching system.It is right successively when analysis module carries out keyword extraction to text
Text carries out keyword extraction.For numeral or Chinese figure Chinese character, if continued presence, the present invention is as a key
Word is handled.For English alphabet, if run into the non-English letters such as space, divided.For Chinese character group
Into sentence, then handled by following order:If 1. character is a continuous Chinese-character digital, and Chinese-character digital is continuous
Put together.If 2. continuous three Chinese characters are all independent, phrase is not formed, the present invention makees this three independent words
Divided for a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm
Point.For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special
Character, each special character is a keyword.
Keyword extraction step is carried out to text to further comprise:
The Chinese phrase of loading.Text to be divided is obtained from participle object;Obtain analysis position and character, discriminatory analysis
Position whether be text to be divided end, if the rearmost position of text to be divided, then ready-portioned text adds
Upper cutting symbol, along with the last character of text to be divided, group has been divided in new ready-portioned text, now text
Into.Since analysis position, cutting symbol position is found.Behind the position for finding cutting symbol, interception analysis position to cutting accords with position
Between character, cutting symbol is plus the text after dividing, composition ready-portioned text.
The storage of Chinese basis dictionary is using double-deck Hash list object storage.First Chinese character using word phrase as key,
Using another Hash list object as the storage organization of key assignments;The Hash list object storage of key assignments is the phrase started with key
The first Chinese character of removing after phrase remainder.The storage organization of the key assignments of first Chinese character be using second Chinese character of word as
Key, using chain type array as key assignments.The first two Chinese character of word in this chain type storage of array dictionary is identical, the from word the 3rd
Individual Chinese character starts different text sequences.
It is that foundation is divided to text by using dictionary, whether there is to find phrase in dictionary.If in word
Exist in storehouse, then continue matching process, if not present in dictionary, then matching terminates, and is divided.
Further, after search obtains collections of web pages, constructed and be based on respectively according to search and webpage own content feature
The similar diagram of feature, while search and webpage binary crelation figure is built based on the interest relation between search and webpage, given few
The classification of unmarked search and webpage is predicted in the case of amount search and webpage category label.
First have to build a figure, node table sample notebook data, while between representing sample according to sample data and its contact
Contact, the weight on side represents the tightness degree contacted between sample.The search for containing a variety of different objects structures is drawn
G=(V, E) can be expressed as by holding up, and wherein V=Q ∪ D can be expressed as the set on different types of summit, and E is connection summit
The set on side.Q is the set of search, and D is the set of webpage.E=EQQ∪EQD∪EDD, wherein EQQ=Q × Q, EQD=Q × D, EDD
=D × D.Make GQ=(Q, EQQ), GQD=(Q, D, EQD), GD=(D, EDD), then G=GQQ∪GQD∪GDD.Wherein GQRepresent by searching for
The subgraph that node is built, GDRepresent the subgraph built by web page joint, GQDRepresent by search node and web page joint according to interest
The subgraph that relation is built.
The weight w on side is defined based on the following distance function between nodeij
wij=exp (- d (xi,xj)/2σ2)
It is two text vector x in text calculatingi, xjBetween included angle cosine;xiAnd xjSection in k neighbours each other
Point;Wherein d (xi,xj) be distance function | | xi-xj| |, σ is regulation parameter.
Original graph structure is changed according to the discriminant information of advance marker samples:
1. construct GQAnd GDAnd GQD.Calculate GQThe average weight w on side between upper all nodesq, GDSide between upper all nodes
Average weight wd。
2:If search flag data is divided into c class, P is expressed asq={ Pq 1, Pq 2..., Pq c, wherein Pq iRepresent i-th
The core set of the mark search of individual classification.Make Mq iThe paired constraint set of i-th of search category mark is represented, if x ∈ Pq iAnd y ∈
Pq i, then (x, y) is added into Mq i.If Web Page Tags data are divided into c class, P is expressed asd={ Pd 1, Pd 2..., Pd c, wherein
Pd iRepresent the core set of the marking of web pages of i-th of classification.Make Md iThe paired constraint set of i-th of webpage category label is represented, if x
∈Pd iAnd y ∈ Pd i, then (x, y) is added into Md i。
3:If searching for sample to (ql, qm)∈Mq i, qkFor qlAnd qmNeighbours, wlk<wqAnd wmk>wq, then by qkAdd Pq i,
(ql, qk) and (qm, qk) is added into Mq i。
4:3 are repeated, until PqNo longer change.
5:If webpage sample is to (dl, dm)∈Md i, dkFor dlAnd dmNeighbours, wlk<wdAnd wmk>wd, then by dkAdd Pd i,
By (dl, dk) and (dm, dk) add Md i。
6:5 are repeated, until PdNo longer change.
7:If searching for sample to (ql, qm)∈Mq i, then G is changedQMake wlm=1, if ql∈Pq iThen change GQMake wlm=0.If
Webpage sample is to (dl, dm)∈Md i, then G is changeddMake wlm=1, if dl∈Pd iThen change GdMake wlm=0.
8:If searching for ql∈gP:And webpage dmgPj, then and modification GQDMake wlm=1.
Wherein wlk, wmk, wlmRespectively in search ql、qm、qlRespectively in webpage dk, dk, dmIn weight.
Relation between search and webpage can be enriched by said process, make the contact between similar node more tight
Gather, the contact between different classes of node is looser, so as to preferably be classified using procedure below:
1. for classification j ∈ { 1 ..., c }, and the node i ∈ { 1 ..., n } in above-mentioned subgraph, construction n × c's is first
Beginningization mark matrix Y.
2. construct the adjacency matrix W in homogeneous network respectively according to the similarity measurement between isomorphism nodeQQ, WDD, according to
Relation construction adjacency matrix W between heterogeneous nodesQDAnd corresponding transposed matrix WDQ。
3. structural matrixWherein
4. F (0)=Y is taken, iterative calculation F (t+1)=μαSF(t)+(1-μα) Y, wherein μαFor the ginseng between (0,1)
Number.
5. setting F* as { F (t) } limit of a sequence, then the node v in G is schemediAccording to yi=argmaxj<cFijCarry out contingency table
Note.
The above-mentioned neighbor node that in an iterative process, label information is constantly broadcast to oneself by each node of figure is straight
A stable state is reached to them.
In summary, the present invention proposes a kind of common interconnection network collecting method, is carried out using transaction controlling strategy
High efficient data capture, data mining is carried out for the coupled relation between multi dimensional object.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted
Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored
Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's
Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention
Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing
Change example.
Claims (1)
1. a kind of common interconnection network collecting method, for the single affairs using search engine acquisition module to website basis
Data are acquired, it is characterised in that including:
Step 1. affairs obtain a non-NULL collection transaction object from transaction queues;If getting sky transaction object, hold
Row transaction scheduling;
Step 2. judges whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object
Sampling depth where Current transaction object;Website sampling depth such as sampling depth not less than system configuration, then affairs continuation
Step 3;
Step 3. judges the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval
Affairs, then perform step 5;
Step 4. judges whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history
Capture in storehouse, be then acquired i.e. step 7 by newfound webpage;If this collection affairs is in history crawl storehouse, from history
Last time collection information, i.e. reference address, access time, page-size, renewal frequency, the root of this web page address are obtained in crawl storehouse
Domain name;Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if
Through more than, then compare the content of pages size of current web page address and last content of pages size, if equal, without
Collection, if unequal, continues step 6;
Step 5. then performs corresponding collection document process if media or file link;If illegal link, then record
This is linked extremely;
Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, performs step
Rapid 8;
Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, with increasing this webpage in history access database
The access record of location;
Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code
Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710433582.6A CN107239558A (en) | 2017-06-09 | 2017-06-09 | Common interconnection network collecting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710433582.6A CN107239558A (en) | 2017-06-09 | 2017-06-09 | Common interconnection network collecting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107239558A true CN107239558A (en) | 2017-10-10 |
Family
ID=59986106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710433582.6A Pending CN107239558A (en) | 2017-06-09 | 2017-06-09 | Common interconnection network collecting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107239558A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753169A (en) * | 2020-06-29 | 2020-10-09 | 金电联行(北京)信息技术有限公司 | Data acquisition system based on internet |
CN113535568A (en) * | 2021-07-22 | 2021-10-22 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
CN114925259A (en) * | 2022-04-20 | 2022-08-19 | 北京网景盛世技术开发中心 | Information acquisition and extraction method and system based on government portal and new media |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090204610A1 (en) * | 2008-02-11 | 2009-08-13 | Hellstrom Benjamin J | Deep web miner |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
CN106294402A (en) * | 2015-05-21 | 2017-01-04 | 阿里巴巴集团控股有限公司 | The data search method of a kind of heterogeneous data source and device thereof |
-
2017
- 2017-06-09 CN CN201710433582.6A patent/CN107239558A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090204610A1 (en) * | 2008-02-11 | 2009-08-13 | Hellstrom Benjamin J | Deep web miner |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
CN106294402A (en) * | 2015-05-21 | 2017-01-04 | 阿里巴巴集团控股有限公司 | The data search method of a kind of heterogeneous data source and device thereof |
Non-Patent Citations (1)
Title |
---|
周庭安: "分布式搜索引擎研究与实现", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753169A (en) * | 2020-06-29 | 2020-10-09 | 金电联行(北京)信息技术有限公司 | Data acquisition system based on internet |
CN113535568A (en) * | 2021-07-22 | 2021-10-22 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
CN113535568B (en) * | 2021-07-22 | 2023-09-05 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
CN114925259A (en) * | 2022-04-20 | 2022-08-19 | 北京网景盛世技术开发中心 | Information acquisition and extraction method and system based on government portal and new media |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
CN116361362B (en) * | 2023-05-30 | 2023-08-11 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5338238B2 (en) | Automatic ontology generation using word similarity | |
CN102053991B (en) | Method and system for multi-language document retrieval | |
Liu et al. | Identifying web spam with the wisdom of the crowds | |
Fujimura et al. | Topigraphy: visualization for large-scale tag clouds | |
CN110543595B (en) | In-station searching system and method | |
CN107239558A (en) | Common interconnection network collecting method | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN104182482B (en) | A kind of news list page determination methods and the method for screening news list page | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
CN111611464A (en) | Big data-based public opinion monitoring platform | |
JP4769151B2 (en) | Document set analysis apparatus, document set analysis method, program implementing the method, and recording medium storing the program | |
CN110334343A (en) | The method and system that individual privacy information extracts in a kind of contract | |
CN109284441B (en) | Dynamic self-adaptive network sensitive information detection method and device | |
CN113343012A (en) | News matching method, device, equipment and storage medium | |
Huang et al. | Design a batched information retrieval system based on a concept-lattice-like structure | |
Li et al. | Geospatial data mining on the web: Discovering locations of emergency service facilities | |
CN107133366A (en) | Data based on meta-search engine find method | |
CN107220367A (en) | Internet data full-text search method | |
JP5321258B2 (en) | Information collecting system, information collecting method and program thereof | |
Hardik et al. | Link analysis of Wikipedia documents using mapreduce | |
CN107169065B (en) | Method and device for removing specific content | |
CN114707003A (en) | Method, equipment and storage medium for dissimilarity of names of thesis authors | |
JP2010244341A (en) | Attribute expression acquisition method, device, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171010 |
|
RJ01 | Rejection of invention patent application after publication |