CN107239558A

CN107239558A - Common interconnection network collecting method

Info

Publication number: CN107239558A
Application number: CN201710433582.6A
Authority: CN
Inventors: 张鹏
Original assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2017-10-10

Abstract

The invention provides a kind of common interconnection network collecting method, this method includes：Transaction scheduling is performed, the type of collection affairs is judged, if media or file link, then corresponding collection document process is performed；If web retrieval transactions access address in history crawl storehouse, is not acquired by newfound webpage；If this collection affairs is in history crawl storehouse, the last time that this web page address is obtained from history crawl storehouse gathers information；If interval time exceedes renewal frequency, compare the content of pages size and last content of pages size of current web page address, if unequal, obtain this web page interlinkage page source code, the collection information of this web page address in history access database is updated, Web Cleanout is performed and extracts.The present invention proposes a kind of common interconnection network collecting method, and high efficient data capture is carried out using transaction controlling strategy, and data mining is carried out for the coupled relation between multi dimensional object.

Description

Common interconnection network collecting method

Technical field

The present invention relates to data retrieval, more particularly to a kind of common interconnection network collecting method.

Background technology

With continuing to develop for Web technologies, network information resource is just increased in the way of geometry speed.How from internet Quick-searching goes out the useful data related to user in magnanimity information turns into current urgent problem.Search engine is exactly Grow up on the basis of information retrieval technique.Search engine helps the present invention preferably to express and store in real world Essential information, and by analyzing the connection information in search engine, having for hiding information can be excavated as a kind of Use instrument.The dependence limited search word of existing search engine merely goes to express user's request, there is this and expresses incomplete problem. Even same search term, the desired result of different users may be also different.Such as microblog system, if it is considered that The relation of microblogging and related interactive object, it can with it is abstract be a heterogeneous network, wherein containing microblogging, information, label And the node such as user.Concern and bean vermicelli relation is there is between microblogging and microblogging, existed between microblogging and information deliver and Forwarding relation, is an inclusion relation between microblogging and label, and holding relationship is there is between user and microblogging.It is existing to search Rope instrument does not consider that the complex environment of above-mentioned multi dimensional object formation carries out data mining.

The content of the invention

To solve the problems of above-mentioned prior art, the present invention proposes a kind of common interconnection network data acquisition side Method, including：

Step 1. affairs obtain a non-NULL collection transaction object from transaction queues；If getting sky transaction object, Then perform transaction scheduling；

Step 2. judges whether the depth for gathering affairs exceedes maximum depth；Affairs are obtained from current collection transaction object Take the sampling depth where its Current transaction object；Website sampling depth such as sampling depth not less than system configuration, then affairs Continue step 3；

Step 3. judges the type of collection affairs；If web retrieval affairs, then step 4 is performed, if not webpage Affairs are gathered, then perform step 5；

Step 4. judges whether new web page or unfinished web page interlinkage；If this collection transactions access address does not exist In history crawl storehouse, then i.e. step 7 is acquired by newfound webpage；If this collection affairs is in history crawl storehouse, from Last time collection information, i.e. reference address, access time, page-size, the renewal frequency of this web page address are obtained in history crawl storehouse Rate, rhizosphere name；Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, such as Fruit alreadys exceed, then compares the content of pages size and last content of pages size of current web page address, if equal, no It is acquired, if unequal, continues step 6；

Step 5. then performs corresponding collection document process if media or file link；If illegal link, then This is recorded to link extremely；

Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, holds Row step 8；

Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, increases this net in history access database The access record of page address；

Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the spy specified from webpage source code Reference ceases, and removes garbage or noise data in webpage source code, then goes out the letter of needs from cleaned extracting data again Breath.

The present invention compared with prior art, with advantages below：

The present invention proposes a kind of common interconnection network collecting method, and carrying out efficient data using transaction controlling strategy adopts Collection, data mining is carried out for the coupled relation between multi dimensional object.

Brief description of the drawings

Fig. 1 is the flow chart of common interconnection network collecting method according to embodiments of the present invention.

Embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of common interconnection network collecting method.Fig. 1 is according to embodiments of the present invention Common interconnection network collecting method flow chart.

The search engine of the present invention can be divided into acquisition module and analysis module.Acquisition module include crawl site database, Crawl website scheduling unit, transaction management controller, affairs container, recording controller, basic database.Transaction management controller For multiple establishments for crawling affairs, startup, operation control and destruction.Affairs have oneself independent container, for transaction resource Management, specifically include：Save buffer unit is crawled, the affairs website number to be crawled is cached for setting up queue in internal memory According to；Transactional cache unit, the data for caching affairs itself；Memory buffers unit, caches the correlation to be stored to database Data；Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized；Number According to cleaning extracting unit, the code of collection is cleaned, effective information is extracted.These effective informations include the mark of webpage Font type, the media characteristic information occurred in topic, keyword, summary, text, webpage.Obtain what web page quality grade was evaluated Relevant information and obtain and new in webpage crawl website；Data storage analytic unit, extracts data by cleaning and is converted to easily The form of storage, data are compressed, and constitute pending database search character string.Recording controller is used for program and data Data exchange processing between storehouse；Including global data source cache unit, data scheduling unit, data access administrative unit.Entirely Office, which crawls buffer unit, to be used to handle transaction latency of many affairs when to critical resource access, reduces many affairs to data storehouse Operational access number of times.Each web crawlers only one of which overall situation crawls buffer unit example.Data access administrative unit is used to count Handled according to the data interaction of storehouse and program.Data scheduling unit realizes the scheduling that single affairs are crawled, when single affairs crawl caching list When not crawled in member, by data scheduling unit from the overall situation crawl buffer unit in obtain and some crawl to affairs that to crawl caching single Member.Data scheduling unit only one of which example in whole program.

The web crawlers of search engine operationally, reading program configuration file first, and will when preloading caching collection The data used；Task manager.According to configuration information, each affairs is initialized, and control the operation of affairs；At affairs acquisition Reason task, first carries out crawling link duplicate removal inspection, analysis crawls the type of link, performs different grab types at different places Reason mode, in collection, analysis is new collection affairs or more new task, and after the webpage source code of link is got, it is right The webpage source code collected performs cleaning, filtering, according to info web correlated characteristic rule, extracts effective information；Affairs pair The information extracted carries out conversion process, is cached；When caching data to be saved reach certain amount, affairs perform caching Data loading processing；Task manager timing simultaneously monitors the execution state of each affairs, and management is controlled to abnormal transaction.

Before crawling, all possible combination domain name is traveled through according to domain name create-rule successively, combination domain name is carried out Detect successively, identification effective domain name and invalid domain name, set up rhizosphere name storehouse；Then the webpage source code of navigation website is obtained, according to Rhizosphere name composition rule extracts root site address and link text from webpage source code, updates rhizosphere name storehouse.

The global collection transaction scheduling unit of each web crawlers only one of which.When the save buffer that crawls of affairs is sky When, transactions requests or wait data scheduling unit crawl website from overall situation collection caching website or database transmission.Transaction scheduling Process is specially：Affairs obtain data scheduling unit control authority；Whether be empty, be such as sky if judging global collection caching website, Then a number of crawl is obtained from database and be cached to global collection caching source, if being not sky, then crawl website from the overall situation and obtain Take it is a number of crawl to Current transaction crawl buffer unit.

For single affairs, its collecting flowchart is specific as follows：

1. affairs obtain a non-NULL collection transaction object from transaction queues.If getting sky transaction object, hold Row transaction scheduling.

2. judge whether the depth for gathering affairs exceedes maximum depth；Affairs obtain it from current collection transaction object Sampling depth where Current transaction object.If sampling depth exceedes the website sampling depth of system configuration, Current transaction Collection terminates.As sampling depth not less than system configuration website sampling depth, then affairs continue step 3.

3. judge the type of collection affairs；If web retrieval affairs, then step 4 is performed, if not web retrieval Affairs, then perform step 5.

4. judge whether new web page or unfinished web page interlinkage；If this collection transactions access address is not in history Capture in storehouse, be then acquired i.e. step 7 by newfound webpage.If this collection affairs is in history crawl storehouse, from history The last time collection information of this web page address is obtained in crawl storehouse：Reference address, access time, page-size, renewal frequency, rhizosphere Name.Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if do not surpassed Cross, then without collection, collection terminates；If it has been exceeded, then comparing the content of pages size and upper one of current web page address Secondary content of pages size, if equal, without collection, if unequal, continues step 6.

5. if media or file link, then perform corresponding collection document process；If illegal link, then record This is linked extremely.

6. obtaining this web page interlinkage page source code, the collection information of this web page address in history access database is updated, step is performed Rapid 8.

7. gathering new task webpage, the source code of this web page interlinkage page is obtained, with increasing this webpage in history access database The access record of location.

8. performing Web Cleanout to extract, the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.Enter One step, according to text similarity measurement algorithm, extracts title from webpage, keyword, descriptive text, mark defined in webpage Link, media resource in topic, text, webpage are used for Ordination.In cleaning, affairs capture program is obtained currently first The web page coding of affairs.Start washer, and initialize.Remove the pattern coding in web page coding and explain coding；Remove net Scripted code in page number, and recognize that current web page whether there is media file according to script code information simultaneously.If deposited Then preserved.Information after extraction is compressed to the form for being converted into being easy to storage and stored.

It is that each affairs dispose special duplicate removal container, each container has only stored oneself for page repeated links The mapping code of the chained address accessed.Duplicate removal container only needs to record the chain oneself accessed under same root domain name website Connect, discard processing is carried out to being not belonging to the web page address of this rhizosphere under one's name.When affairs start to gather another root site information, The history access record of duplicate removal container is emptied, new root site access record is recorded again.Information acquisition device is crawled into website Depth sets threshold value, during the operation of each affairs, the internal memory shared by actual duplicate removal container can by crawl website depth threshold come Control.

To reduce transaction latency, the present invention uses multilayer buffer structure, each layer is cached according to the memory size of computer Size is configured.The overall situation is crawled first and cached.In the access connection procedure to crawling database, using disposable Acquisition batch crawls result and cached.Secondly caching is crawled using single affairs itself.Each affairs each possess one Gathered data source cache region.Then the data of the generation to affairs in processing procedure are cached, and are included in link duplicate removal During inspection, webpage that cache access is crossed, media links address.Last layer of caching is caching data to be saved.When to be saved Data reach certain amount after, affairs just carry out storage preservation to data.

The analysis module of the search engine is used to gather at the basic data progress analysis of the text returned, media Reason, is that keyword sets up index, is easy to the search of searching system.It is right successively when analysis module carries out keyword extraction to text Text carries out keyword extraction.For numeral or Chinese figure Chinese character, if continued presence, the present invention is as a key Word is handled.For English alphabet, if run into the non-English letters such as space, divided.For Chinese character group Into sentence, then handled by following order：If 1. character is a continuous Chinese-character digital, and Chinese-character digital is continuous Put together.If 2. continuous three Chinese characters are all independent, phrase is not formed, the present invention makees this three independent words Divided for a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm Point.For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special Character, each special character is a keyword.

Keyword extraction step is carried out to text to further comprise：

The Chinese phrase of loading.Text to be divided is obtained from participle object；Obtain analysis position and character, discriminatory analysis Position whether be text to be divided end, if the rearmost position of text to be divided, then ready-portioned text adds Upper cutting symbol, along with the last character of text to be divided, group has been divided in new ready-portioned text, now text Into.Since analysis position, cutting symbol position is found.Behind the position for finding cutting symbol, interception analysis position to cutting accords with position Between character, cutting symbol is plus the text after dividing, composition ready-portioned text.

The storage of Chinese basis dictionary is using double-deck Hash list object storage.First Chinese character using word phrase as key, Using another Hash list object as the storage organization of key assignments；The Hash list object storage of key assignments is the phrase started with key The first Chinese character of removing after phrase remainder.The storage organization of the key assignments of first Chinese character be using second Chinese character of word as Key, using chain type array as key assignments.The first two Chinese character of word in this chain type storage of array dictionary is identical, the from word the 3rd Individual Chinese character starts different text sequences.

It is that foundation is divided to text by using dictionary, whether there is to find phrase in dictionary.If in word Exist in storehouse, then continue matching process, if not present in dictionary, then matching terminates, and is divided.

Further, after search obtains collections of web pages, constructed and be based on respectively according to search and webpage own content feature The similar diagram of feature, while search and webpage binary crelation figure is built based on the interest relation between search and webpage, given few The classification of unmarked search and webpage is predicted in the case of amount search and webpage category label.

First have to build a figure, node table sample notebook data, while between representing sample according to sample data and its contact Contact, the weight on side represents the tightness degree contacted between sample.The search for containing a variety of different objects structures is drawn G=(V, E) can be expressed as by holding up, and wherein V=Q ∪ D can be expressed as the set on different types of summit, and E is connection summit The set on side.Q is the set of search, and D is the set of webpage.E=E_QQ∪E_QD∪E_DD, wherein E_QQ=Q × Q, E_QD=Q × D, E_DD =D × D.Make G_Q=(Q, E_QQ), G_QD=(Q, D, E_QD), G_D=(D, E_DD), then G=G_QQ∪G_QD∪G_DD.Wherein G_QRepresent by searching for The subgraph that node is built, G_DRepresent the subgraph built by web page joint, G_QDRepresent by search node and web page joint according to interest The subgraph that relation is built.

The weight w on side is defined based on the following distance function between node_ij

w_ij=exp (- d (x_i,x_j)/2σ²)

It is two text vector x in text calculating_i, x_jBetween included angle cosine；x_iAnd x_jSection in k neighbours each other Point；Wherein d (x_i,x_j) be distance function | | x_i-x_j| |, σ is regulation parameter.

Original graph structure is changed according to the discriminant information of advance marker samples：

1. construct G_QAnd G_DAnd G_QD.Calculate G_QThe average weight w on side between upper all nodes_q, G_DSide between upper all nodes Average weight w_d。

2：If search flag data is divided into c class, P is expressed as_q={ P_q ¹, P_q ²..., P_q ^c, wherein P_q ⁱRepresent i-th The core set of the mark search of individual classification.Make M_q ⁱThe paired constraint set of i-th of search category mark is represented, if x ∈ P_q ⁱAnd y ∈ P_q ⁱ, then (x, y) is added into M_q ⁱ.If Web Page Tags data are divided into c class, P is expressed as_d={ P_d ¹, P_d ²..., P_d ^c, wherein P_d ⁱRepresent the core set of the marking of web pages of i-th of classification.Make M_d ⁱThe paired constraint set of i-th of webpage category label is represented, if x ∈P_d ⁱAnd y ∈ P_d ⁱ, then (x, y) is added into M_d ⁱ。

3：If searching for sample to (q_l, q_m)∈M_q ⁱ, q_kFor q_lAnd q_mNeighbours, wlk<w_qAnd w_mk>w_q, then by q_kAdd P_q ⁱ, (ql, qk) and (qm, qk) is added into M_q ⁱ。

4:3 are repeated, until P_qNo longer change.

5：If webpage sample is to (d_l, d_m)∈M_d ⁱ, d_kFor d_lAnd d_mNeighbours, w_lk<w_dAnd w_mk>w_d, then by d_kAdd P_d ⁱ, By (d_l, d_k) and (d_m, d_k) add M_d ⁱ。

6:5 are repeated, until P_dNo longer change.

7：If searching for sample to (q_l, q_m)∈M_q ⁱ, then G is changed_QMake w_lm=1, if q_l∈P_q ⁱThen change G_QMake w_lm=0.If Webpage sample is to (d_l, d_m)∈M_d ⁱ, then G is changed_dMake w_lm=1, if d_l∈P_d ⁱThen change G_dMake w_lm=0.

8：If searching for q_l∈gP:And webpage dmgPj, then and modification G_QDMake w_lm=1.

Wherein w_lk, w_mk, w_lmRespectively in search q_l、q_m、q_lRespectively in webpage d_k, d_k, d_mIn weight.

Relation between search and webpage can be enriched by said process, make the contact between similar node more tight Gather, the contact between different classes of node is looser, so as to preferably be classified using procedure below：

1. for classification j ∈ { 1 ..., c }, and the node i ∈ { 1 ..., n } in above-mentioned subgraph, construction n × c's is first Beginningization mark matrix Y.

2. construct the adjacency matrix W in homogeneous network respectively according to the similarity measurement between isomorphism node_QQ, W_DD, according to Relation construction adjacency matrix W between heterogeneous nodes_QDAnd corresponding transposed matrix W_DQ。

3. structural matrixWherein

4. F (0)=Y is taken, iterative calculation F (t+1)=μ_αSF(t)+(1-μ_α) Y, wherein μ_αFor the ginseng between (0,1) Number.

5. setting F* as { F (t) } limit of a sequence, then the node v in G is schemed_iAccording to y_i=argmax_j<cF_ijCarry out contingency table Note.

The above-mentioned neighbor node that in an iterative process, label information is constantly broadcast to oneself by each node of figure is straight A stable state is reached to them.

In summary, the present invention proposes a kind of common interconnection network collecting method, is carried out using transaction controlling strategy High efficient data capture, data mining is carried out for the coupled relation between multi dimensional object.

Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.

It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims

1. a kind of common interconnection network collecting method, for the single affairs using search engine acquisition module to website basis Data are acquired, it is characterised in that including：

Step 1. affairs obtain a non-NULL collection transaction object from transaction queues；If getting sky transaction object, hold Row transaction scheduling；

Step 2. judges whether the depth for gathering affairs exceedes maximum depth；Affairs obtain it from current collection transaction object Sampling depth where Current transaction object；Website sampling depth such as sampling depth not less than system configuration, then affairs continuation Step 3；

Step 3. judges the type of collection affairs；If web retrieval affairs, then step 4 is performed, if not web retrieval Affairs, then perform step 5；

Step 4. judges whether new web page or unfinished web page interlinkage；If this collection transactions access address is not in history Capture in storehouse, be then acquired i.e. step 7 by newfound webpage；If this collection affairs is in history crawl storehouse, from history Last time collection information, i.e. reference address, access time, page-size, renewal frequency, the root of this web page address are obtained in crawl storehouse Domain name；Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if Through more than, then compare the content of pages size of current web page address and last content of pages size, if equal, without Collection, if unequal, continues step 6；

Step 5. then performs corresponding collection document process if media or file link；If illegal link, then record This is linked extremely；

Step 6. obtains this web page interlinkage page source code, updates the collection information of this web page address in history access database, performs step Rapid 8；

Step 7. gathers new task webpage, obtains the source code of this web page interlinkage page, with increasing this webpage in history access database The access record of location；

Step 8. performs Web Cleanout and extracted, and the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.