CN108388669A - Distributed computing method for data mining - Google Patents
Distributed computing method for data mining Download PDFInfo
- Publication number
- CN108388669A CN108388669A CN201810226620.5A CN201810226620A CN108388669A CN 108388669 A CN108388669 A CN 108388669A CN 201810226620 A CN201810226620 A CN 201810226620A CN 108388669 A CN108388669 A CN 108388669A
- Authority
- CN
- China
- Prior art keywords
- data
- homepage
- page
- secondary page
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2264—Multidimensional index structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2452—Query translation
- G06F16/24522—Translation of natural language queries to structured queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of distributed computing method for data mining, this method includes:Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains hit homepage ID corresponding with participle filtering item to gather, and hit secondary page ID corresponding with hit homepage ID gathers.The present invention proposes a kind of distributed computing method for data mining, optimizes data search engine technology, while ensureing that search engine searches for multidimensional data high-performance, reduces the update cost of data, efficiently realizes multi-dimensional search.
Description
Technical field
The present invention relates to big data search, more particularly to a kind of distributed computing method for data mining.
Background technology
Huge variation, customer data, transaction data, social media data and network behavior etc. are occurring for big data
Data all contain huge high value business information, they decide future and the development of enterprise.Based on the real-time of big data
The requirement of search also becomes higher and higher, and the real-time search engine under the big data environment currently increased income due to its performance,
There is also certain application risks for the reasons such as stability and experience accumulation, and while multidimensional data high-performance is searched for, when
Between and space cost it is excessively high.
Invention content
To solve the problems of above-mentioned prior art, the present invention propose it is a kind of for data mining it is distributed based on
Calculation method, including:
Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains and segmented
The corresponding hit homepage ID set of item is filtered, and hit secondary page ID corresponding with hit homepage ID gathers.
Preferably, the incidence relation according between homepage face data and secondary page data establishes index structure,
Further comprise, according to the incidence relation between homepage face data and secondary page data, establishes homepage reverse indexing table
And secondary page reverse indexing table;
By recording the storage location of associated secondary page in homepage reverse indexing table, and it is anti-in secondary page
Homepage face data and secondary page data are associated by the storage location that associated homepage is recorded into concordance list;
The homepage reverse indexing table includes:Subject term indexes, and homepage record set corresponding with subject term index;
Wherein, the page ID in the target home page face indexed including subject term, and and target are stored in homepage record
The associated secondary page information of homepage;
The secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index
Face record set;
Wherein, the page ID of the targeted secondary page including secondary glossarial index is stored in secondary page record, and
With the associated homepage information of targeted secondary page ID.
Preferably, the homepage face data crawled and secondary page data different Hadoop is stored respectively in deposit
It stores up in node;
Also, in the Hadoop memory nodes, the different pages correspond to different page IDs.
The present invention compared with prior art, has the following advantages:
The present invention proposes a kind of distributed computing method for data mining, optimizes data search engine technology,
While ensureing that search engine searches for multidimensional data high-performance, the update cost of data is reduced, multidimensional is efficiently realized
Search.
Description of the drawings
Fig. 1 is the flow chart of the distributed computing method according to the ... of the embodiment of the present invention for data mining.
Specific implementation mode
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention
It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right
Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with
Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of distributed computing method for data mining.Fig. 1 is according to the present invention
The distributed computing method flow chart for data mining of embodiment.The present invention is suitable for establishing the index for carrying out multi-dimensional search
The case where structure, specifically includes:
Homepage face data and secondary page data that reptile module crawls are stored respectively in different Hadoop
In memory node.Optionally, in order to improve search performance, at least one secondary page for belonging to same homepage can continuously be deposited
It is stored in continuous physical blocks in Hadoop memory nodes.
According to the incidence relation between homepage face data and secondary page data, the master for carrying out multi-dimensional search is established
Page reverse indexing table and secondary page reverse indexing table.
Wherein, record has the storage location with the associated secondary page of homepage in homepage reverse indexing table, secondary
Record has the storage location with the associated homepage of secondary page in grade page reverse indexing table.By in homepage reverse indexing
The storage location of associated secondary page is recorded in table, and records associated homepage in secondary page reverse indexing table
Storage location, may be implemented can be quickly by homepage if homepage face data and secondary page data are carried out separate storage
Data carry out quick association with secondary page data.
It may include in homepage reverse indexing table:Subject term indexes, and homepage record set corresponding with subject term index,
In, the page ID in the target home page face indexed including subject term is stored in homepage record, and be associated with target home page face
Secondary page information;
Secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index note
Record collection, wherein the page ID of the targeted secondary page including secondary glossarial index, and and mesh are stored in secondary page record
Mark the associated homepage information of secondary page ID;Wherein, in Hadoop memory nodes, the different pages correspond to the different pages
ID。
During search engine actual search, search engine parses multi-dimensional search request input by user, obtains and more
Tie up the corresponding participle filtering item of searching request.Segmenting filtering item includes:Chopped-off head segments filtering item, and/or secondary participle filtering item.
According to participle filtering item, index structure is searched for, hit homepage ID corresponding with participle filtering item is obtained and gathers, with
And hit secondary page ID corresponding with hit homepage ID gathers.
Chopped-off head segments filtering item or secondary participle filtering item corresponds to one or more lexical item attribute, lexical item attribute pair
It should be in the glossarial index in the homepage reverse indexing table or secondary page reverse indexing table of index structure.
In the MapReduce frames of search engine, filtering item is segmented according to chopped-off head first, determines corresponding homepage ID
Set;Filtering item is segmented according to secondary later, determines corresponding mapping document collection, is with homepage ID in mapping document concentration
Key is combined into corresponding Value with secondary page ID collection corresponding with homepage ID.Obtaining homepage ID set and mapping
After document sets, intersection is taken by the Key for concentrating homepage ID set with mapping document, it may be determined that finally meet condition
Hit homepage ID set, Value corresponding with homepage ID set is hit concentrated according to mapping document later, can determine and
Hit the corresponding hit secondary page ID set of homepage ID.
It is ranked up according to predefined ordering rule pair hit homepage corresponding with hit homepage ID set, and will row
Sequence result combines hit secondary page corresponding with hit secondary page ID set to be shown.
It, can be with simultaneous display and hit homepage or hit in addition to display hit homepage and hit secondary page
Other corresponding display properties of secondary page, including access times, user's scoring, so that user has more obtained search result
It gets information about.
Wherein, index structure is searched for according to participle filtering item, obtains hit homepage ID collection corresponding with participle filtering item
It closes, and hit secondary page ID corresponding with hit homepage ID gathers, specially:
According to chopped-off head search terms attribute corresponding with chopped-off head participle filtering item, the homepage searched in index structure is reversed
Concordance list obtains first homepage ID corresponding with chopped-off head participle filtering item and gathers;According to corresponding with secondary participle filtering item
Secondary search lexical item attribute searches for the secondary page reverse indexing table in index structure, obtains corresponding with secondary participle filtering item
First object mapping document collection;According to the first homepage ID set and first object mapping document collection of acquisition, life is determined
Middle homepage ID set, and hit secondary page ID corresponding with hit homepage ID gather.
Optionally, the first homepage ID set corresponding with chopped-off head participle filtering item is obtained to may include:
According to chopped-off head search terms attribute corresponding at least two chopped-off heads participle filtering item, search in index structure
Homepage reverse indexing table, obtain and each chopped-off head participle filtering item corresponding homepage ID set;
By at least two homepage ID collection conjunction intersections of acquisition, the first homepage corresponding with chopped-off head participle filtering item is obtained
Face ID gathers.
Also, it obtains first object mapping document collection corresponding with secondary participle filtering item may include:
According to secondary search lexical item attribute corresponding at least two grade participle filtering items, search in index structure
Secondary page reverse indexing table, obtain at least two alternative mapping document collection;At least two alternative mapping documents are concentrated
Including each Key take intersection, obtain target Key;Include right respectively with each target Key by least two alternative mapping document concentrations
Each Value answered takes intersection, obtains target Value;According to target Key and target Value, first object mapping document is generated
Collection.
At least one sequence preference is determined according to ordering rule, and is obtained in the page ID search attribute mapping table to prestore
The attribute value of sequence preference corresponding with each hit homepage ID.
According to the attribute value of sequence preference corresponding with each hit homepage ID, calculates and distinguish with each hit homepage
Corresponding sequence score value, and according to sequence score value, each hit homepage is ranked up.According to performance parameter, determine with
The corresponding hit homepage to be shown of current page, and according to first object mapping document collection, obtain and matching chopped-off head to be shown
The corresponding hit secondary page to be shown of data.
It can be that the size of display screen curtain and the size of display font are determined according to above-mentioned performance parameter to show parameter
The quantity for the hit homepage that can be shown in each display page, total amount and current page based on hit homepage ID
Face ID determines hit homepage to be shown corresponding with current page and hit secondary page to be shown.
According to hit homepage to be shown and hit secondary page to be shown construction search display items, and each search is shown
Aspect is shown in current page.
The index structure is further depicted as:Each associated homepage face data and secondary page data are used as one
A individual page record, is indexed;In homepage reverse indexing table, lexical item is directed toward every in the reverse indexing table of record
In a reverse indexing record, the page ID of homepage face data is had recorded, and be directed toward the beginning of the page ID of secondary page data
And offset.The secondary page data for belonging to the same homepage face data in secondary page data must Coutinuous store formation one
Logical blocks, every record in each secondary page data back concordance list, storage lexical item are directed toward the page ID of record, and
The page ID of its affiliated homepage face data.
Specifically, in the search process of search engine, the sequence preference and secondary page number of homepage face data are defined
According to sequence preference, parse searching request input by user later, form the filtering item of chopped-off head data and secondary data, Yi Jijie
Total page number PN in fruit and per page data size PS, search result needs to return to upper secondary page ID set, according to filtering item,
Carry out following search process:
To each filtering item i of homepage face data, the corresponding homepage reverse indexing table of the filtering item is looked for, searching bar is used
The lexical item attribute of the filtering item in part searches for homepage reverse indexing table, gets the corresponding homepage ID collection of the lexical item attribute
Close Ui, search condition has N number of, there is N number of page ID set, Ui∈ U, i ∈ [1, N], U are main page ID set;
To each filtering item j of secondary data, the corresponding secondary page reverse indexing table of the filtering item is found, search is used
The glossarial index of the filtering item in condition searches for secondary page reverse indexing table, gets the corresponding secondary page ID of the glossarial index
Set Lj, use LjThe page ID for obtaining the affiliated homepage face data of secondary page data is the mapping document collection Pmap of Keyj, homepage
Face ID is Key, and secondary page ID collection is combined into Value, if search condition has M, to M PmapjIntersection operation is carried out, first
Intersection operation is carried out to Key value sets, intersection operation is being carried out to the corresponding Value of each Key values, obtains final mapping text
Shelves collection Pmap*;
Carrying out merger to the homepage ID set in page ID set U and Pmap* asks friendship to obtain final homepage ID collection
Close R;
During data in generating set R, the homepage in set R is added to each and uses its page ID,
The required each sequence preference of formula is obtained using page ID search attribute mapping table according to homepage data sorting formula,
Formula is completed score value socre is calculated and stores the value in the corresponding record in set R;
The result R* between PN*PS to (PN+1) * PS is intercepted to set R descending sorts according to obtained score values;Traversal
Each homepage ID in each result record RC that the section includes, is handled as follows:
A, search result Pmap* obtains the set of subpage frame IDs of each homepage ID under the filtering item, is arranged
For a display items in RC;B, according in searching request, it is desirable that the display properties of return, the mapping of searched page ID search attributes
Table fills the attribute value of display properties, a display items being set in RC;
RC result datas R* is returned to show to front end.
Preferably, further comprising the steps of in search engine in the above-mentioned search to index:
Reptile data are monitored, is fixed separator format by the Data Format Transform of reptile data, obtains
Reptile data in Hadoop frames, and generate the index information of reptile data;
Coding is carried out according to preset coding mode to reptile data and generates encoding block, and generates corresponding reverse indexing,
Wherein, reverse indexing includes offset after original offset amount and coding;
Index information is written to data warehouse according to data warehouse format, wherein data warehouse includes index gauge outfit, rope
Draw block, range searching block and concordance list tail;Specifically, first write-in indexes Table Header information, then recurrent wrIting indexes, and records
Fileinfo, range searching information and concordance list tail information.It waits for that all indexes all write, is finally recorded in closing of a file important
Document misregistration amount, fileinfo, range searching information and concordance list tail.
After receiving indexed search request, searching request is parsed, calls data warehouse, and block is searched for by reading area
In information, search obtain the index information of reptile data;
According to index information and reverse indexing, index gauge outfit and concordance list tail are first read when reading data warehouse, obtains institute
There are the relevant parameter of index and other information, then reading area search information last reads index to find index block message
Block message starts to compare in reading process, and finds block where index, and it is corresponding in encoding block that search gets reptile data
Position obtains reptile data according to location finding.
Reptile data according to preset coding mode are encoded and generated with corresponding reverse indexing, is specifically included:
1, MapReduce monitored directories are realized reptile data and are encoded, and obtain coding blocks of files, and each block contains reptile
Size after size of data and coding.
2, the index of encoding block is generated, and is recorded in the first data warehouse;
3, second data warehouse is set, offset after record original offset amount and coding, and by the second data repository definitions
For reverse indexing.
Specifically, physical record is size before and after present encoding block coding in the first data warehouse, and the second data
Warehouse has recorded the offset for being equivalent to file start-position, is an aggregate-value.
After generating reverse indexing, reverse indexing is merged according to index type by MapReduce, while can be with
Realize that the sequence to data, the index data after ordering by merging can preferably carry out piecemeal using the sequencer procedure of MapReduce
Index.
Index information is written to data warehouse, is specifically included:
1, the index Table Header information of data warehouse is written, wherein index Table Header information is the essential information of data warehouse;Rope
It includes version number, block size, index type and index name to draw Table Header information.
2, the index block being written to the index value of index information and location information in data warehouse;Each single index packet
Containing index value, positional number, record number and single location information, single location information includes file ID and offset;
3, file index block message and range searching block message are updated;
4, quantity information, range searching block offset and file index block offset that index information is written are written to number
According to the concordance list tail in warehouse.
Searching request is parsed, data warehouse is called and information in the block is searched for by reading area, search obtains reptile number
According to index information, specifically include:
1, after receiving indexed search request, analysis request obtains search index;
2, the index gauge outfit and concordance list tail for reading data warehouse, obtain the range searching block offset and text of data warehouse
Part indexes block offset;Specifically, obtaining the relevant parameter of all indexes and information by concordance list tail and index gauge outfit.
3, according to the index value of range searching block offset and search index, the index block where search index is found;
4, index information corresponding with searching request is found in index block.
The maximum page to be collected for crawling page quantity is extracted at random in current secondary page, according to default evaluation letter
Number calculates the page score value for obtaining each secondary page in current home face, and chooses page score value and be more than default page score value
Each secondary page of threshold value, as candidate page, be directed to respectively each secondary page obtain each secondary page respectively with work as
The degree of association of preceding keyword, including:Participle operation is carried out for the text of secondary page, builds point corresponding to the page text
Word set;According to each participle for segmenting concentration corresponding to the page text, calculated using bayes algorithms obtain the secondary page with
The degree of association of current key word;
Using current key word and the relevant acquisition page set of current key word institute as training sample, obtain current
The page degree of approximation corresponding to keyword crawls process, and crawling process by the page degree of approximation realizes and current key word related pages
The search in face;
For each page for being more than default degree of association threshold value with current key word association degree, structure current key word institute phase
The acquisition page set of pass;
The weight that the corresponding participle of page text concentrates each participle is calculated, and according to the weight of each participle, for this
Participle collection corresponding to page text carries out dimensionality reduction, updates the participle collection corresponding to the page text.Using vector space model
Algorithm calculates the similarity between each participle of the corresponding participle concentration of page text;Corresponding to the page text
Participle concentrates each participle, according to the similarity between each participle, using Text Clustering Algorithm, for phase each other
The each participle for being more than default similarity threshold like degree is polymerize, and the participle collection corresponding to the page text is updated.
The evaluation function is as follows:
Fns(linki)=fi sim+fi link+fi parent+fi label+fi relevant/total+λ
Wherein Fns (linki) represent the page score value of i-th of page;fsimWhat is represented is i-th of Web page predicting theme phase
Guan Du;flinkWhat is indicated is the link analysis value of i-th of page URL;fparentWhat is represented is the phase of the homepage of i-th of page
Guan Du, flabelThat represent is the label weighted value of i-th of page URL, fi relevant/totalWhat is represented is related to current key word
Page quantity and page total quantity ratio;λ is the dynamic value adaptively adjusted.
Correspondingly, in the reptile module of search engine, it preferably includes multiple crawler and at least one collect device.Its
In, it is deployed with web retrieval process and at least one application in each crawler, each collects device and at least one crawler
Communication connection.Crawler acquires the web data generated during application operation in real time by web retrieval process, and passes through
The web data of acquisition is sent to and collects device with what crawler foundation communicated to connect by web retrieval process;Device is collected to receiving
The web data of crawler integrated, and will be in the web data storage to Hadoop frames after integration.
Process of the web retrieval process as a running background is monitored and is applied in acquisition applications server and transported in real time
The web data that row generates in the process.It collects device and receives the web data and storage that crawler is sent, this collects device and sends net
The application server of page data is different server.
In search engine cloud platform, when there is new host to be created, crawler can be deployed in new host, real
The elasticity deployment of host in existing cloud platform.Device is collected further before being integrated to web data, to multiple crawler
Web data carries out data filtering;Device is collected based on the user mutual behavior in web data, carries out user's identification;For identification
Each user behavior is interacted according to the time sequencing between the user mutual behavior of the user in corresponding web data
Identification.For the ease of subsequently to the processing of web data, the present invention by MapReduce methods to the web data that receives into
Row processing such as carries out data filtering and user's identification in the Map stages, Activity recognition is interacted in the Reduce stages.
When carrying out data filtering to web data in the Map stages, for each web data received, device reading is collected
User mutual behavior each time in the web data, if current user mutual behavior can not be read by collecting device, from
Processing is purged in the web data received.
For the web data after progress data filtering, convert web data to<Key, value>Form, Map stages
Processing be described as follows:Map (key1, value1)->list<Key2, value2>Form, wherein Key1 is Digital ID, mark
Know the position where the web data;Value1 indicates the user mutual behavior of position storage;Key2 is the current pass of user
Keyword;Value2 is the record information after data filtering.Such as:Assuming that Key2 is currently user name, then value2 can be with table
Levy the specific username information identified from web data.
When user is identified, record in web data is had to the user of username information, is determined as registering user,
To there is IP address information, the user without username information to be determined as nonregistered user in web data.Wherein, user
The identical user of name information is the same registration user, and the identical user of IP address information is the same nonregistered user.
For each user of identification, Activity recognition is interacted according to following steps:It is somebody's turn to do according in corresponding web data
The user mutual behavior corresponding operating time of user, each secondary user mutual behavior is sorted sequentially in time;After sequence
Each secondary user mutual behavior in, the user mutual behavior at least once for meeting preset condition is determined as a session;It is default
Condition includes:User mutual behavior corresponding operating time preceding user mutual behavior adjacent thereto and it is adjacent after
Time difference between the user mutual behavior corresponding operating time is all higher than given threshold;
Further, it collects device and the web data after integration is also divided into multiple web page fragments, and extract each webpage
The corresponding key word information of segment;Web page fragments are respectively stored into corresponding data warehouse according to default distribution principle;And it will
Correspondence storage between the key word information and the web page fragments storage location of each web page fragments.Specifically, Mei Gegui
Storage is communicated to connect at least one data warehouse, and a data warehouse can be disposed in each memory blocks Hadoop or deployment is more
A data warehouse;Distribution principle can be load balancing principle, i.e., be uniformly that each data warehouse distributes web page fragments, to protect
Demonstrate,prove multiple data warehouse load balancing;Correspondence can be the identification information of the key word information and data warehouse of web page fragments
Correspondence.
, can be based on the size of data of setting when the web data after integrating is divided into multiple web page fragments, it will be whole
Web data after conjunction is divided into multiple web page fragments;Alternatively, the Webpage correlation information based on setting, by the webpage number after integration
According to being divided into multiple web page fragments;Alternatively, the data type based on setting, multiple webpages are divided by the web data after integration
Segment.
After completion data crawl acquisition, in the scheduling of Hadoop frame memory nodes, the present invention obtains accumulation layer Dk< U
Or UkEach node of < U, k ∈ { 1 ..., K }, K indicate the quantity of accumulation layer interior joint, DkIndicate the download speed of k-th of node
Degree, UkIndicate that the uploading speed of k-th of node, U indicate to preset accumulation layer transmission data minimum amount of bandwidth;For each node,
It is specific respectively to execute such as dispatching:
Obtain accumulation layer Dk> U' and UkEach node of > U', as each spare memory area, wherein O' indicates default
The minimum amount of bandwidth of accumulation layer back end to be received;
Obtain the data storage cost consum of each spare memory area respectively according to formulaq:
consumq=nq*(consum'q+consum”q)+xq*consum”'q
Wherein, q ∈ { 1 ..., Q }, Q indicate the quantity of spare memory area, consumqIndicate q in each spare memory area
A spare memory area data storage cost, consum'qIndicate q-th of spare memory area storage data in each spare memory area
Unit storage cost, consum "qIndicate the unit transmission cost of q-th of spare memory area in each spare memory area,
Consum " ' q indicate the request of data cost of q-th of spare memory area in each spare memory area, nqIndicate each slack storage
Data storage capacity in area needed for q-th of spare memory area, xqIndicate in each spare memory area asking for q-th spare memory area
Number is sought, and further obtains each data storage cost less than default node data storage cost threshold value, it is each for this
The corresponding spare memory area of data storage cost difference, builds available storage area set S.
It takes out a node at random from set S, and the node is deleted from the set S of available storage area, and obtain the section
The size of the storage data of point, and initialization m are default the amount of migration size, subsequently into step D.
If m no more than the size of the storage data of the available storage area, is obtained or update hypothesis is excellent by highest in node
Corresponding search time t after the data to the available storage area of first level data concentration removal m sizes1;
In wherein step D, highest priority data collection in node is obtained according to following process:It is directed to node respectively first
In each data set, obtain the search q of data set, to the searching times c of data set, intermediate data transmission caused by data set
The improvement j of delay, the reduction t' and data set of data set movement caused maximum search time move required cost
consum;Then each data set being directed to respectively in node, obtains the value of each data set, and according to data set value with
The ratio of cost, obtains the score of each data set needed for data set movement, and according to the sequence of score from high to low, arrangement is each
Data set priority is from high to low.
Obtain or update the search time t based on the assumption that corresponding after data movement2, judge t2Whether t is less than1, it is to adopt
M is updated with default migrating data increment;Otherwise by t1As node to corresponding after the available storage area migrating data
Minimum search time, and record the migrating data size corresponding to minimum search time.
If m is more than the size of the storage data of the available storage area, judge in the set S of available storage area with the presence or absence of section
Point, if it does not exist, then for each minimum search time, it is right to obtain wherein minimum value institute if it does, method iteration executes
The available storage area answered and migrating data size, using the available storage area as destination node, using the migrating data size as
Target migrating data size, by highest priority number in node according to concentrate the data for removing target migrating data size to the target section
Point.
Above-mentioned data dispatch strategy stores data in the node close to data, can greatly reduce the consumption of bandwidth, reduce
Delay of the data transmission to caused search response.
During the content search of search engine, the higher filtering item of recent search rate is counted, including remain
The filtering item of high search rate, and the filtering item largely searched within the recent short time.Search frequency is defined for filtering item
FRQ is spent, the height for distinguishing search rate between different filtering items.Search frequency FRQ for searching for filtering item a, a was calculated
Cheng Wei:
Participle the t days searched numbers of a are obtained according to search and webpage.Wherein t indicates that webpage record time gap is effectively remembered
Record the number of days of initial time;
Setting records ratio q in the recent periodrec.For the t days search and webpage Dt, as t >=qrecWhen × T, DtIncluding record conduct
Record in the recent period;As t < qrecWhen × T, DtIncluding record is used as historical record.For historical record, the searching times upper limit is set
Threshold value Maxhis, when searching times are higher than MaxhisWhen, searching times Countstd(t) value is Maxhis;To recording in the near future, if
Set searching times upper limit threshold Maxrec, when searching times are higher than MaxrecWhen, searching times Countstd(t) value is Maxrec。
To obtain the frequent episode largely searched in the recent short time, there should be Maxrec> Maxhis。
According to weighting function W (t), to Countstd(t) it is weighted operation.Weighting function W (t) meets the following conditions:
Wherein b is the adjustment parameter more than 1.
The frequency of computation attribute a:
FRQ=Σ W (t) × Countstd(t)
On the basis of the result of calculation of FRQ, using a kind of alternative manner successively searched for, finds out and frequency is searched for according to filtering item
The frequent 1 item collection set of degree, the set are denoted as A1.A1 is used to look for the set A2 of frequent 2 item collection.K is found by way of iteration
Item collection Ak.Each AkIt is required for carrying out single pass to entire web database.Ensure all nonvoid subsets of any frequent item set
Also must be frequent.Until there is no the Frequent Sets of bigger.
The filtering item collection for including different length in the set of properties set of return frequently defines the satisfaction of different length
Filtering item combination corresponding data is cached into memory, and more frequently result attribute establishes tree index, processing caching for search
The case where middle attribute section hit.When user scans for operation, if the data block of search is in memory, directly in memory
In scan for and return to search result.
Real data is searched for, there are 3 kinds of different hit situations.
(1) miss:Filtering item in search is not in memory cache.Again data block data is loaded from disk to carry out
Search operation.(2) partial hit:Filtering item only a fraction in search is in memory cache.According to existing category in memory
Property data the record in database is filtered, reducing needs the data volume that further search in the database.Corresponding knot is provided
The index value of fruit attribute accelerates the search of data.(3) hit completely:Filtering item in search is all in memory cache.Directly
Return to search result.
Buffer update operation should be carried out in the period of node visit load minimum.Renewal process includes:
(1) memory cache data are disabled, occurs directly searching the data block of storage bottom when data search request, delay at this time
It is 0 to deposit attribute hit rate;
(2) frequent search data blocks and frequent search attribute group collection are obtained.
(3) frequent search data blocks are locked, the modification operation of limitation data data in block.
(4) compare wait caching frequently search data in memory have it is data cached, reservation it is identical data cached, for
Different is data cached, removes old high-frequency data, loads new high-frequency data.
(5) frequent search data blocks are unlocked, restores the modification authority of data data in block.
(6) data cached index structure is updated, memory cache data are enabled.
In conclusion the present invention proposes a kind of distributed computing method for data mining, data search is optimized
Engine technique reduces the update cost of data while ensureing that search engine searches for multidimensional data high-performance, efficiently real
Multi-dimensional search is showed.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed
Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored
It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's
Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (3)
1. a kind of distributed computing method for data mining, special for realizing multidimensional data search in a search engine
Sign is, including:
Index structure is established according to the incidence relation between homepage face data and secondary page data;It obtains and participle filtering item
Corresponding hit homepage ID set, and hit secondary page ID corresponding with hit homepage ID gather.
2. according to the method described in claim 1, it is characterized in that, it is described according to homepage face data and secondary page data it
Between incidence relation establish index structure, further comprise, according to the association between homepage face data and secondary page data
Relationship establishes homepage reverse indexing table and secondary page reverse indexing table;
By recording the storage location of associated secondary page in homepage reverse indexing table, and in the reversed rope of secondary page
Draw the storage location for recording associated homepage in table, homepage face data and secondary page data are associated;
The homepage reverse indexing table includes:Subject term indexes, and homepage record set corresponding with subject term index;
Wherein, the page ID in the target home page face indexed including subject term, and and target home page are stored in homepage record
The associated secondary page information in face;
The secondary page reverse indexing table includes:Secondary glossarial index, and secondary page corresponding with secondary glossarial index note
Record collection;
Wherein, the page ID of the targeted secondary page including secondary glossarial index, and and mesh are stored in secondary page record
Mark the associated homepage information of secondary page ID.
3. according to the method described in claim 1, it is characterized in that, further including:
The homepage face data crawled and secondary page data are stored respectively in different Hadoop memory nodes;
Also, in the Hadoop memory nodes, the different pages correspond to different page IDs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810226620.5A CN108388669A (en) | 2018-03-19 | 2018-03-19 | Distributed computing method for data mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810226620.5A CN108388669A (en) | 2018-03-19 | 2018-03-19 | Distributed computing method for data mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108388669A true CN108388669A (en) | 2018-08-10 |
Family
ID=63067202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810226620.5A Withdrawn CN108388669A (en) | 2018-03-19 | 2018-03-19 | Distributed computing method for data mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108388669A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160012134A1 (en) * | 2011-01-27 | 2016-01-14 | International Business Machines Corporation | Distributed multi-system management |
CN105677904A (en) * | 2016-02-04 | 2016-06-15 | 杭州数梦工场科技有限公司 | Distributed file system based small file storage method and device |
CN107229631A (en) * | 2016-03-24 | 2017-10-03 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for capturing website data |
-
2018
- 2018-03-19 CN CN201810226620.5A patent/CN108388669A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160012134A1 (en) * | 2011-01-27 | 2016-01-14 | International Business Machines Corporation | Distributed multi-system management |
CN105677904A (en) * | 2016-02-04 | 2016-06-15 | 杭州数梦工场科技有限公司 | Distributed file system based small file storage method and device |
CN107229631A (en) * | 2016-03-24 | 2017-10-03 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for capturing website data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109739849B (en) | Data-driven network sensitive information mining and early warning platform | |
CN110704411B (en) | Knowledge graph building method and device suitable for art field and electronic equipment | |
US7502780B2 (en) | Information storage and retrieval | |
US7779001B2 (en) | Web page ranking with hierarchical considerations | |
JP5092165B2 (en) | Data construction method and system | |
CN104376052B (en) | A kind of same money commodity merging method based on commodity image | |
US7516397B2 (en) | Methods, apparatus and computer programs for characterizing web resources | |
US20060095852A1 (en) | Information storage and retrieval | |
CN106202514A (en) | Accident based on Agent is across the search method of media information and system | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
US20090094194A1 (en) | Method and system for optimizing database performance | |
Xie et al. | Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb | |
US20120233096A1 (en) | Optimizing an index of web documents | |
CN110390352A (en) | A kind of dark data value appraisal procedure of image based on similitude Hash | |
US7668853B2 (en) | Information storage and retrieval | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
CN102855245A (en) | Image similarity determining method and image similarity determining equipment | |
CN110287201A (en) | Data access method, device, equipment and storage medium | |
CN101251857A (en) | Information storage and research | |
CN108427759A (en) | Real time data computational methods for mass data processing | |
CN113343012B (en) | News matching method, device, equipment and storage medium | |
CN107169020B (en) | directional webpage collecting method based on keywords | |
CN107133321B (en) | Method and device for analyzing search characteristics of page | |
CN108319626B (en) | Object classification method and device based on name information | |
CN113222109A (en) | Internet of things edge algorithm based on multi-source heterogeneous data aggregation technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180810 |
|
WW01 | Invention patent application withdrawn after publication |