CN108388669A - Distributed computing method for data mining - Google Patents

Distributed computing method for data mining Download PDF

Info

Publication number
CN108388669A
CN108388669A CN201810226620.5A CN201810226620A CN108388669A CN 108388669 A CN108388669 A CN 108388669A CN 201810226620 A CN201810226620 A CN 201810226620A CN 108388669 A CN108388669 A CN 108388669A
Authority
CN
China
Prior art keywords
data
homepage
page
secondary page
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810226620.5A
Other languages
Chinese (zh)
Inventor
张悠
陈熹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Ico Huizhi Technology Co Ltd
Original Assignee
Sichuan Ico Huizhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Ico Huizhi Technology Co Ltd filed Critical Sichuan Ico Huizhi Technology Co Ltd
Priority to CN201810226620.5A priority Critical patent/CN108388669A/en
Publication of CN108388669A publication Critical patent/CN108388669A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of distributed computing method for data mining, this method includes:Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains hit homepage ID corresponding with participle filtering item to gather, and hit secondary page ID corresponding with hit homepage ID gathers.The present invention proposes a kind of distributed computing method for data mining, optimizes data search engine technology, while ensureing that search engine searches for multidimensional data high-performance, reduces the update cost of data, efficiently realizes multi-dimensional search.

Description

Distributed computing method for data mining
Technical field
The present invention relates to big data search, more particularly to a kind of distributed computing method for data mining.
Background technology
Huge variation, customer data, transaction data, social media data and network behavior etc. are occurring for big data Data all contain huge high value business information, they decide future and the development of enterprise.Based on the real-time of big data The requirement of search also becomes higher and higher, and the real-time search engine under the big data environment currently increased income due to its performance, There is also certain application risks for the reasons such as stability and experience accumulation, and while multidimensional data high-performance is searched for, when Between and space cost it is excessively high.
Invention content
To solve the problems of above-mentioned prior art, the present invention propose it is a kind of for data mining it is distributed based on Calculation method, including:
Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains and segmented The corresponding hit homepage ID set of item is filtered, and hit secondary page ID corresponding with hit homepage ID gathers.
Preferably, the incidence relation according between homepage face data and secondary page data establishes index structure, Further comprise, according to the incidence relation between homepage face data and secondary page data, establishes homepage reverse indexing table And secondary page reverse indexing table;
By recording the storage location of associated secondary page in homepage reverse indexing table, and it is anti-in secondary page Homepage face data and secondary page data are associated by the storage location that associated homepage is recorded into concordance list;
The homepage reverse indexing table includes:Subject term indexes, and homepage record set corresponding with subject term index;
Wherein, the page ID in the target home page face indexed including subject term, and and target are stored in homepage record The associated secondary page information of homepage;
The secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index Face record set;
Wherein, the page ID of the targeted secondary page including secondary glossarial index is stored in secondary page record, and With the associated homepage information of targeted secondary page ID.
Preferably, the homepage face data crawled and secondary page data different Hadoop is stored respectively in deposit It stores up in node;
Also, in the Hadoop memory nodes, the different pages correspond to different page IDs.
The present invention compared with prior art, has the following advantages:
The present invention proposes a kind of distributed computing method for data mining, optimizes data search engine technology, While ensureing that search engine searches for multidimensional data high-performance, the update cost of data is reduced, multidimensional is efficiently realized Search.
Description of the drawings
Fig. 1 is the flow chart of the distributed computing method according to the ... of the embodiment of the present invention for data mining.
Specific implementation mode
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of distributed computing method for data mining.Fig. 1 is according to the present invention The distributed computing method flow chart for data mining of embodiment.The present invention is suitable for establishing the index for carrying out multi-dimensional search The case where structure, specifically includes:
Homepage face data and secondary page data that reptile module crawls are stored respectively in different Hadoop In memory node.Optionally, in order to improve search performance, at least one secondary page for belonging to same homepage can continuously be deposited It is stored in continuous physical blocks in Hadoop memory nodes.
According to the incidence relation between homepage face data and secondary page data, the master for carrying out multi-dimensional search is established Page reverse indexing table and secondary page reverse indexing table.
Wherein, record has the storage location with the associated secondary page of homepage in homepage reverse indexing table, secondary Record has the storage location with the associated homepage of secondary page in grade page reverse indexing table.By in homepage reverse indexing The storage location of associated secondary page is recorded in table, and records associated homepage in secondary page reverse indexing table Storage location, may be implemented can be quickly by homepage if homepage face data and secondary page data are carried out separate storage Data carry out quick association with secondary page data.
It may include in homepage reverse indexing table:Subject term indexes, and homepage record set corresponding with subject term index, In, the page ID in the target home page face indexed including subject term is stored in homepage record, and be associated with target home page face Secondary page information;
Secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index note Record collection, wherein the page ID of the targeted secondary page including secondary glossarial index, and and mesh are stored in secondary page record Mark the associated homepage information of secondary page ID;Wherein, in Hadoop memory nodes, the different pages correspond to the different pages ID。
During search engine actual search, search engine parses multi-dimensional search request input by user, obtains and more Tie up the corresponding participle filtering item of searching request.Segmenting filtering item includes:Chopped-off head segments filtering item, and/or secondary participle filtering item.
According to participle filtering item, index structure is searched for, hit homepage ID corresponding with participle filtering item is obtained and gathers, with And hit secondary page ID corresponding with hit homepage ID gathers.
Chopped-off head segments filtering item or secondary participle filtering item corresponds to one or more lexical item attribute, lexical item attribute pair It should be in the glossarial index in the homepage reverse indexing table or secondary page reverse indexing table of index structure.
In the MapReduce frames of search engine, filtering item is segmented according to chopped-off head first, determines corresponding homepage ID Set;Filtering item is segmented according to secondary later, determines corresponding mapping document collection, is with homepage ID in mapping document concentration Key is combined into corresponding Value with secondary page ID collection corresponding with homepage ID.Obtaining homepage ID set and mapping After document sets, intersection is taken by the Key for concentrating homepage ID set with mapping document, it may be determined that finally meet condition Hit homepage ID set, Value corresponding with homepage ID set is hit concentrated according to mapping document later, can determine and Hit the corresponding hit secondary page ID set of homepage ID.
It is ranked up according to predefined ordering rule pair hit homepage corresponding with hit homepage ID set, and will row Sequence result combines hit secondary page corresponding with hit secondary page ID set to be shown.
It, can be with simultaneous display and hit homepage or hit in addition to display hit homepage and hit secondary page Other corresponding display properties of secondary page, including access times, user's scoring, so that user has more obtained search result It gets information about.
Wherein, index structure is searched for according to participle filtering item, obtains hit homepage ID collection corresponding with participle filtering item It closes, and hit secondary page ID corresponding with hit homepage ID gathers, specially:
According to chopped-off head search terms attribute corresponding with chopped-off head participle filtering item, the homepage searched in index structure is reversed Concordance list obtains first homepage ID corresponding with chopped-off head participle filtering item and gathers;According to corresponding with secondary participle filtering item Secondary search lexical item attribute searches for the secondary page reverse indexing table in index structure, obtains corresponding with secondary participle filtering item First object mapping document collection;According to the first homepage ID set and first object mapping document collection of acquisition, life is determined Middle homepage ID set, and hit secondary page ID corresponding with hit homepage ID gather.
Optionally, the first homepage ID set corresponding with chopped-off head participle filtering item is obtained to may include:
According to chopped-off head search terms attribute corresponding at least two chopped-off heads participle filtering item, search in index structure Homepage reverse indexing table, obtain and each chopped-off head participle filtering item corresponding homepage ID set;
By at least two homepage ID collection conjunction intersections of acquisition, the first homepage corresponding with chopped-off head participle filtering item is obtained Face ID gathers.
Also, it obtains first object mapping document collection corresponding with secondary participle filtering item may include:
According to secondary search lexical item attribute corresponding at least two grade participle filtering items, search in index structure Secondary page reverse indexing table, obtain at least two alternative mapping document collection;At least two alternative mapping documents are concentrated Including each Key take intersection, obtain target Key;Include right respectively with each target Key by least two alternative mapping document concentrations Each Value answered takes intersection, obtains target Value;According to target Key and target Value, first object mapping document is generated Collection.
At least one sequence preference is determined according to ordering rule, and is obtained in the page ID search attribute mapping table to prestore The attribute value of sequence preference corresponding with each hit homepage ID.
According to the attribute value of sequence preference corresponding with each hit homepage ID, calculates and distinguish with each hit homepage Corresponding sequence score value, and according to sequence score value, each hit homepage is ranked up.According to performance parameter, determine with The corresponding hit homepage to be shown of current page, and according to first object mapping document collection, obtain and matching chopped-off head to be shown The corresponding hit secondary page to be shown of data.
It can be that the size of display screen curtain and the size of display font are determined according to above-mentioned performance parameter to show parameter The quantity for the hit homepage that can be shown in each display page, total amount and current page based on hit homepage ID Face ID determines hit homepage to be shown corresponding with current page and hit secondary page to be shown.
According to hit homepage to be shown and hit secondary page to be shown construction search display items, and each search is shown Aspect is shown in current page.
The index structure is further depicted as:Each associated homepage face data and secondary page data are used as one A individual page record, is indexed;In homepage reverse indexing table, lexical item is directed toward every in the reverse indexing table of record In a reverse indexing record, the page ID of homepage face data is had recorded, and be directed toward the beginning of the page ID of secondary page data And offset.The secondary page data for belonging to the same homepage face data in secondary page data must Coutinuous store formation one Logical blocks, every record in each secondary page data back concordance list, storage lexical item are directed toward the page ID of record, and The page ID of its affiliated homepage face data.
Specifically, in the search process of search engine, the sequence preference and secondary page number of homepage face data are defined According to sequence preference, parse searching request input by user later, form the filtering item of chopped-off head data and secondary data, Yi Jijie Total page number PN in fruit and per page data size PS, search result needs to return to upper secondary page ID set, according to filtering item, Carry out following search process:
To each filtering item i of homepage face data, the corresponding homepage reverse indexing table of the filtering item is looked for, searching bar is used The lexical item attribute of the filtering item in part searches for homepage reverse indexing table, gets the corresponding homepage ID collection of the lexical item attribute Close Ui, search condition has N number of, there is N number of page ID set, Ui∈ U, i ∈ [1, N], U are main page ID set;
To each filtering item j of secondary data, the corresponding secondary page reverse indexing table of the filtering item is found, search is used The glossarial index of the filtering item in condition searches for secondary page reverse indexing table, gets the corresponding secondary page ID of the glossarial index Set Lj, use LjThe page ID for obtaining the affiliated homepage face data of secondary page data is the mapping document collection Pmap of Keyj, homepage Face ID is Key, and secondary page ID collection is combined into Value, if search condition has M, to M PmapjIntersection operation is carried out, first Intersection operation is carried out to Key value sets, intersection operation is being carried out to the corresponding Value of each Key values, obtains final mapping text Shelves collection Pmap*;
Carrying out merger to the homepage ID set in page ID set U and Pmap* asks friendship to obtain final homepage ID collection Close R;
During data in generating set R, the homepage in set R is added to each and uses its page ID, The required each sequence preference of formula is obtained using page ID search attribute mapping table according to homepage data sorting formula, Formula is completed score value socre is calculated and stores the value in the corresponding record in set R;
The result R* between PN*PS to (PN+1) * PS is intercepted to set R descending sorts according to obtained score values;Traversal Each homepage ID in each result record RC that the section includes, is handled as follows:
A, search result Pmap* obtains the set of subpage frame IDs of each homepage ID under the filtering item, is arranged For a display items in RC;B, according in searching request, it is desirable that the display properties of return, the mapping of searched page ID search attributes Table fills the attribute value of display properties, a display items being set in RC;
RC result datas R* is returned to show to front end.
Preferably, further comprising the steps of in search engine in the above-mentioned search to index:
Reptile data are monitored, is fixed separator format by the Data Format Transform of reptile data, obtains Reptile data in Hadoop frames, and generate the index information of reptile data;
Coding is carried out according to preset coding mode to reptile data and generates encoding block, and generates corresponding reverse indexing, Wherein, reverse indexing includes offset after original offset amount and coding;
Index information is written to data warehouse according to data warehouse format, wherein data warehouse includes index gauge outfit, rope Draw block, range searching block and concordance list tail;Specifically, first write-in indexes Table Header information, then recurrent wrIting indexes, and records Fileinfo, range searching information and concordance list tail information.It waits for that all indexes all write, is finally recorded in closing of a file important Document misregistration amount, fileinfo, range searching information and concordance list tail.
After receiving indexed search request, searching request is parsed, calls data warehouse, and block is searched for by reading area In information, search obtain the index information of reptile data;
According to index information and reverse indexing, index gauge outfit and concordance list tail are first read when reading data warehouse, obtains institute There are the relevant parameter of index and other information, then reading area search information last reads index to find index block message Block message starts to compare in reading process, and finds block where index, and it is corresponding in encoding block that search gets reptile data Position obtains reptile data according to location finding.
Reptile data according to preset coding mode are encoded and generated with corresponding reverse indexing, is specifically included:
1, MapReduce monitored directories are realized reptile data and are encoded, and obtain coding blocks of files, and each block contains reptile Size after size of data and coding.
2, the index of encoding block is generated, and is recorded in the first data warehouse;
3, second data warehouse is set, offset after record original offset amount and coding, and by the second data repository definitions For reverse indexing.
Specifically, physical record is size before and after present encoding block coding in the first data warehouse, and the second data Warehouse has recorded the offset for being equivalent to file start-position, is an aggregate-value.
After generating reverse indexing, reverse indexing is merged according to index type by MapReduce, while can be with Realize that the sequence to data, the index data after ordering by merging can preferably carry out piecemeal using the sequencer procedure of MapReduce Index.
Index information is written to data warehouse, is specifically included:
1, the index Table Header information of data warehouse is written, wherein index Table Header information is the essential information of data warehouse;Rope It includes version number, block size, index type and index name to draw Table Header information.
2, the index block being written to the index value of index information and location information in data warehouse;Each single index packet Containing index value, positional number, record number and single location information, single location information includes file ID and offset;
3, file index block message and range searching block message are updated;
4, quantity information, range searching block offset and file index block offset that index information is written are written to number According to the concordance list tail in warehouse.
Searching request is parsed, data warehouse is called and information in the block is searched for by reading area, search obtains reptile number According to index information, specifically include:
1, after receiving indexed search request, analysis request obtains search index;
2, the index gauge outfit and concordance list tail for reading data warehouse, obtain the range searching block offset and text of data warehouse Part indexes block offset;Specifically, obtaining the relevant parameter of all indexes and information by concordance list tail and index gauge outfit.
3, according to the index value of range searching block offset and search index, the index block where search index is found;
4, index information corresponding with searching request is found in index block.
The maximum page to be collected for crawling page quantity is extracted at random in current secondary page, according to default evaluation letter Number calculates the page score value for obtaining each secondary page in current home face, and chooses page score value and be more than default page score value Each secondary page of threshold value, as candidate page, be directed to respectively each secondary page obtain each secondary page respectively with work as The degree of association of preceding keyword, including:Participle operation is carried out for the text of secondary page, builds point corresponding to the page text Word set;According to each participle for segmenting concentration corresponding to the page text, calculated using bayes algorithms obtain the secondary page with The degree of association of current key word;
Using current key word and the relevant acquisition page set of current key word institute as training sample, obtain current The page degree of approximation corresponding to keyword crawls process, and crawling process by the page degree of approximation realizes and current key word related pages The search in face;
For each page for being more than default degree of association threshold value with current key word association degree, structure current key word institute phase The acquisition page set of pass;
The weight that the corresponding participle of page text concentrates each participle is calculated, and according to the weight of each participle, for this Participle collection corresponding to page text carries out dimensionality reduction, updates the participle collection corresponding to the page text.Using vector space model Algorithm calculates the similarity between each participle of the corresponding participle concentration of page text;Corresponding to the page text Participle concentrates each participle, according to the similarity between each participle, using Text Clustering Algorithm, for phase each other The each participle for being more than default similarity threshold like degree is polymerize, and the participle collection corresponding to the page text is updated.
The evaluation function is as follows:
Fns(linki)=fi sim+fi link+fi parent+fi label+fi relevant/total
Wherein Fns (linki) represent the page score value of i-th of page;fsimWhat is represented is i-th of Web page predicting theme phase Guan Du;flinkWhat is indicated is the link analysis value of i-th of page URL;fparentWhat is represented is the phase of the homepage of i-th of page Guan Du, flabelThat represent is the label weighted value of i-th of page URL, fi relevant/totalWhat is represented is related to current key word Page quantity and page total quantity ratio;λ is the dynamic value adaptively adjusted.
Correspondingly, in the reptile module of search engine, it preferably includes multiple crawler and at least one collect device.Its In, it is deployed with web retrieval process and at least one application in each crawler, each collects device and at least one crawler Communication connection.Crawler acquires the web data generated during application operation in real time by web retrieval process, and passes through The web data of acquisition is sent to and collects device with what crawler foundation communicated to connect by web retrieval process;Device is collected to receiving The web data of crawler integrated, and will be in the web data storage to Hadoop frames after integration.
Process of the web retrieval process as a running background is monitored and is applied in acquisition applications server and transported in real time The web data that row generates in the process.It collects device and receives the web data and storage that crawler is sent, this collects device and sends net The application server of page data is different server.
In search engine cloud platform, when there is new host to be created, crawler can be deployed in new host, real The elasticity deployment of host in existing cloud platform.Device is collected further before being integrated to web data, to multiple crawler Web data carries out data filtering;Device is collected based on the user mutual behavior in web data, carries out user's identification;For identification Each user behavior is interacted according to the time sequencing between the user mutual behavior of the user in corresponding web data Identification.For the ease of subsequently to the processing of web data, the present invention by MapReduce methods to the web data that receives into Row processing such as carries out data filtering and user's identification in the Map stages, Activity recognition is interacted in the Reduce stages.
When carrying out data filtering to web data in the Map stages, for each web data received, device reading is collected User mutual behavior each time in the web data, if current user mutual behavior can not be read by collecting device, from Processing is purged in the web data received.
For the web data after progress data filtering, convert web data to<Key, value>Form, Map stages Processing be described as follows:Map (key1, value1)->list<Key2, value2>Form, wherein Key1 is Digital ID, mark Know the position where the web data;Value1 indicates the user mutual behavior of position storage;Key2 is the current pass of user Keyword;Value2 is the record information after data filtering.Such as:Assuming that Key2 is currently user name, then value2 can be with table Levy the specific username information identified from web data.
When user is identified, record in web data is had to the user of username information, is determined as registering user, To there is IP address information, the user without username information to be determined as nonregistered user in web data.Wherein, user The identical user of name information is the same registration user, and the identical user of IP address information is the same nonregistered user.
For each user of identification, Activity recognition is interacted according to following steps:It is somebody's turn to do according in corresponding web data The user mutual behavior corresponding operating time of user, each secondary user mutual behavior is sorted sequentially in time;After sequence Each secondary user mutual behavior in, the user mutual behavior at least once for meeting preset condition is determined as a session;It is default Condition includes:User mutual behavior corresponding operating time preceding user mutual behavior adjacent thereto and it is adjacent after Time difference between the user mutual behavior corresponding operating time is all higher than given threshold;
Further, it collects device and the web data after integration is also divided into multiple web page fragments, and extract each webpage The corresponding key word information of segment;Web page fragments are respectively stored into corresponding data warehouse according to default distribution principle;And it will Correspondence storage between the key word information and the web page fragments storage location of each web page fragments.Specifically, Mei Gegui Storage is communicated to connect at least one data warehouse, and a data warehouse can be disposed in each memory blocks Hadoop or deployment is more A data warehouse;Distribution principle can be load balancing principle, i.e., be uniformly that each data warehouse distributes web page fragments, to protect Demonstrate,prove multiple data warehouse load balancing;Correspondence can be the identification information of the key word information and data warehouse of web page fragments Correspondence.
, can be based on the size of data of setting when the web data after integrating is divided into multiple web page fragments, it will be whole Web data after conjunction is divided into multiple web page fragments;Alternatively, the Webpage correlation information based on setting, by the webpage number after integration According to being divided into multiple web page fragments;Alternatively, the data type based on setting, multiple webpages are divided by the web data after integration Segment.
After completion data crawl acquisition, in the scheduling of Hadoop frame memory nodes, the present invention obtains accumulation layer Dk< U Or UkEach node of < U, k ∈ { 1 ..., K }, K indicate the quantity of accumulation layer interior joint, DkIndicate the download speed of k-th of node Degree, UkIndicate that the uploading speed of k-th of node, U indicate to preset accumulation layer transmission data minimum amount of bandwidth;For each node, It is specific respectively to execute such as dispatching:
Obtain accumulation layer Dk> U' and UkEach node of > U', as each spare memory area, wherein O' indicates default The minimum amount of bandwidth of accumulation layer back end to be received;
Obtain the data storage cost consum of each spare memory area respectively according to formulaq
consumq=nq*(consum'q+consum”q)+xq*consum”'q
Wherein, q ∈ { 1 ..., Q }, Q indicate the quantity of spare memory area, consumqIndicate q in each spare memory area A spare memory area data storage cost, consum'qIndicate q-th of spare memory area storage data in each spare memory area Unit storage cost, consum "qIndicate the unit transmission cost of q-th of spare memory area in each spare memory area, Consum " ' q indicate the request of data cost of q-th of spare memory area in each spare memory area, nqIndicate each slack storage Data storage capacity in area needed for q-th of spare memory area, xqIndicate in each spare memory area asking for q-th spare memory area Number is sought, and further obtains each data storage cost less than default node data storage cost threshold value, it is each for this The corresponding spare memory area of data storage cost difference, builds available storage area set S.
It takes out a node at random from set S, and the node is deleted from the set S of available storage area, and obtain the section The size of the storage data of point, and initialization m are default the amount of migration size, subsequently into step D.
If m no more than the size of the storage data of the available storage area, is obtained or update hypothesis is excellent by highest in node Corresponding search time t after the data to the available storage area of first level data concentration removal m sizes1
In wherein step D, highest priority data collection in node is obtained according to following process:It is directed to node respectively first In each data set, obtain the search q of data set, to the searching times c of data set, intermediate data transmission caused by data set The improvement j of delay, the reduction t' and data set of data set movement caused maximum search time move required cost consum;Then each data set being directed to respectively in node, obtains the value of each data set, and according to data set value with The ratio of cost, obtains the score of each data set needed for data set movement, and according to the sequence of score from high to low, arrangement is each Data set priority is from high to low.
Obtain or update the search time t based on the assumption that corresponding after data movement2, judge t2Whether t is less than1, it is to adopt M is updated with default migrating data increment;Otherwise by t1As node to corresponding after the available storage area migrating data Minimum search time, and record the migrating data size corresponding to minimum search time.
If m is more than the size of the storage data of the available storage area, judge in the set S of available storage area with the presence or absence of section Point, if it does not exist, then for each minimum search time, it is right to obtain wherein minimum value institute if it does, method iteration executes The available storage area answered and migrating data size, using the available storage area as destination node, using the migrating data size as Target migrating data size, by highest priority number in node according to concentrate the data for removing target migrating data size to the target section Point.
Above-mentioned data dispatch strategy stores data in the node close to data, can greatly reduce the consumption of bandwidth, reduce Delay of the data transmission to caused search response.
During the content search of search engine, the higher filtering item of recent search rate is counted, including remain The filtering item of high search rate, and the filtering item largely searched within the recent short time.Search frequency is defined for filtering item FRQ is spent, the height for distinguishing search rate between different filtering items.Search frequency FRQ for searching for filtering item a, a was calculated Cheng Wei:
Participle the t days searched numbers of a are obtained according to search and webpage.Wherein t indicates that webpage record time gap is effectively remembered Record the number of days of initial time;
Setting records ratio q in the recent periodrec.For the t days search and webpage Dt, as t >=qrecWhen × T, DtIncluding record conduct Record in the recent period;As t < qrecWhen × T, DtIncluding record is used as historical record.For historical record, the searching times upper limit is set Threshold value Maxhis, when searching times are higher than MaxhisWhen, searching times Countstd(t) value is Maxhis;To recording in the near future, if Set searching times upper limit threshold Maxrec, when searching times are higher than MaxrecWhen, searching times Countstd(t) value is Maxrec。 To obtain the frequent episode largely searched in the recent short time, there should be Maxrec> Maxhis
According to weighting function W (t), to Countstd(t) it is weighted operation.Weighting function W (t) meets the following conditions:
Wherein b is the adjustment parameter more than 1.
The frequency of computation attribute a:
FRQ=Σ W (t) × Countstd(t)
On the basis of the result of calculation of FRQ, using a kind of alternative manner successively searched for, finds out and frequency is searched for according to filtering item The frequent 1 item collection set of degree, the set are denoted as A1.A1 is used to look for the set A2 of frequent 2 item collection.K is found by way of iteration Item collection Ak.Each AkIt is required for carrying out single pass to entire web database.Ensure all nonvoid subsets of any frequent item set Also must be frequent.Until there is no the Frequent Sets of bigger.
The filtering item collection for including different length in the set of properties set of return frequently defines the satisfaction of different length Filtering item combination corresponding data is cached into memory, and more frequently result attribute establishes tree index, processing caching for search The case where middle attribute section hit.When user scans for operation, if the data block of search is in memory, directly in memory In scan for and return to search result.
Real data is searched for, there are 3 kinds of different hit situations.
(1) miss:Filtering item in search is not in memory cache.Again data block data is loaded from disk to carry out Search operation.(2) partial hit:Filtering item only a fraction in search is in memory cache.According to existing category in memory Property data the record in database is filtered, reducing needs the data volume that further search in the database.Corresponding knot is provided The index value of fruit attribute accelerates the search of data.(3) hit completely:Filtering item in search is all in memory cache.Directly Return to search result.
Buffer update operation should be carried out in the period of node visit load minimum.Renewal process includes:
(1) memory cache data are disabled, occurs directly searching the data block of storage bottom when data search request, delay at this time It is 0 to deposit attribute hit rate;
(2) frequent search data blocks and frequent search attribute group collection are obtained.
(3) frequent search data blocks are locked, the modification operation of limitation data data in block.
(4) compare wait caching frequently search data in memory have it is data cached, reservation it is identical data cached, for Different is data cached, removes old high-frequency data, loads new high-frequency data.
(5) frequent search data blocks are unlocked, restores the modification authority of data data in block.
(6) data cached index structure is updated, memory cache data are enabled.
In conclusion the present invention proposes a kind of distributed computing method for data mining, data search is optimized Engine technique reduces the update cost of data while ensureing that search engine searches for multidimensional data high-performance, efficiently real Multi-dimensional search is showed.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (3)

1. a kind of distributed computing method for data mining, special for realizing multidimensional data search in a search engine Sign is, including:
Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains and participle filtering item Corresponding hit homepage ID set, and hit secondary page ID corresponding with hit homepage ID gather.
2. according to the method described in claim 1, it is characterized in that, it is described according to homepage face data and secondary page data it Between incidence relation establish index structure, further comprise, according to the association between homepage face data and secondary page data Relationship establishes homepage reverse indexing table and secondary page reverse indexing table;
By recording the storage location of associated secondary page in homepage reverse indexing table, and in the reversed rope of secondary page Draw the storage location for recording associated homepage in table, homepage face data and secondary page data are associated;
The homepage reverse indexing table includes:Subject term indexes, and homepage record set corresponding with subject term index;
Wherein, the page ID in the target home page face indexed including subject term, and and target home page are stored in homepage record The associated secondary page information in face;
The secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index note Record collection;
Wherein, the page ID of the targeted secondary page including secondary glossarial index, and and mesh are stored in secondary page record Mark the associated homepage information of secondary page ID.
3. according to the method described in claim 1, it is characterized in that, further including:
The homepage face data crawled and secondary page data are stored respectively in different Hadoop memory nodes;
Also, in the Hadoop memory nodes, the different pages correspond to different page IDs.
CN201810226620.5A 2018-03-19 2018-03-19 Distributed computing method for data mining Withdrawn CN108388669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810226620.5A CN108388669A (en) 2018-03-19 2018-03-19 Distributed computing method for data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810226620.5A CN108388669A (en) 2018-03-19 2018-03-19 Distributed computing method for data mining

Publications (1)

Publication Number Publication Date
CN108388669A true CN108388669A (en) 2018-08-10

Family

ID=63067202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810226620.5A Withdrawn CN108388669A (en) 2018-03-19 2018-03-19 Distributed computing method for data mining

Country Status (1)

Country Link
CN (1) CN108388669A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160012134A1 (en) * 2011-01-27 2016-01-14 International Business Machines Corporation Distributed multi-system management
CN105677904A (en) * 2016-02-04 2016-06-15 杭州数梦工场科技有限公司 Distributed file system based small file storage method and device
CN107229631A (en) * 2016-03-24 2017-10-03 北京京东尚科信息技术有限公司 A kind of method and apparatus for capturing website data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160012134A1 (en) * 2011-01-27 2016-01-14 International Business Machines Corporation Distributed multi-system management
CN105677904A (en) * 2016-02-04 2016-06-15 杭州数梦工场科技有限公司 Distributed file system based small file storage method and device
CN107229631A (en) * 2016-03-24 2017-10-03 北京京东尚科信息技术有限公司 A kind of method and apparatus for capturing website data

Similar Documents

Publication Publication Date Title
CN109739849B (en) Data-driven network sensitive information mining and early warning platform
CN110704411B (en) Knowledge graph building method and device suitable for art field and electronic equipment
US7502780B2 (en) Information storage and retrieval
US7779001B2 (en) Web page ranking with hierarchical considerations
JP5092165B2 (en) Data construction method and system
CN104376052B (en) A kind of same money commodity merging method based on commodity image
US7516397B2 (en) Methods, apparatus and computer programs for characterizing web resources
US20060095852A1 (en) Information storage and retrieval
CN106202514A (en) Accident based on Agent is across the search method of media information and system
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
US20090094194A1 (en) Method and system for optimizing database performance
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
US20120233096A1 (en) Optimizing an index of web documents
CN110390352A (en) A kind of dark data value appraisal procedure of image based on similitude Hash
US7668853B2 (en) Information storage and retrieval
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
CN102855245A (en) Image similarity determining method and image similarity determining equipment
CN110287201A (en) Data access method, device, equipment and storage medium
CN101251857A (en) Information storage and research
CN108427759A (en) Real time data computational methods for mass data processing
CN113343012B (en) News matching method, device, equipment and storage medium
CN107169020B (en) directional webpage collecting method based on keywords
CN107133321B (en) Method and device for analyzing search characteristics of page
CN108319626B (en) Object classification method and device based on name information
CN113222109A (en) Internet of things edge algorithm based on multi-source heterogeneous data aggregation technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20180810

WW01 Invention patent application withdrawn after publication